arxiv: 2604.18842 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Multi-Domain Learning with Global Expert Mapping

Pourya Shamsolmoali , Masoumeh Zareapoor , Huiyu Zhou , Oscar Mendez , Dacheng Tao , Xuelong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-domain learningmixture of expertsglobal expert mappinglinear programmingdomain adaptationfew-shot adaptationvision modelsexpert specialization

0 comments

The pith

GEM replaces learned routers in mixture-of-experts models with a global linear-programming planner and rounding compiler to assign datasets to specialized experts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard mixture-of-experts models enforce uniform load across experts through balancing losses, which prevents them from specializing on the distinct distributions and label semantics of multiple datasets. GEM introduces a planner-compiler structure that first solves a linear-programming relaxation to produce fractional dataset-to-expert assignments and then applies hierarchical rounding to obtain a fixed, capacity-aware mapping. This removes the balancing loss entirely and yields deterministic, interpretable routing. On the UODB benchmark the resulting GEM-DINO model reaches state-of-the-art accuracy, with the largest gains on underrepresented domains, and reduces task interference when adapting to new tasks with few shots.

Core claim

The central claim is that a global scheduler built from linear-programming relaxation plus hierarchical rounding can compute an effective, capacity-constrained assignment of datasets to experts. Because the assignment is planned globally rather than learned locally, the fairness constraint that forces uniform expert usage disappears, allowing each expert to specialize on coherent subsets of domains without redundancy or conflict.

What carries the argument

Global planner that relaxes dataset-to-expert assignment into a linear program and a compiler that converts the fractional solution into a deterministic mapping via hierarchical rounding.

If this is right

Experts receive coherent subsets of data and therefore learn less redundant representations.
Routing decisions become fixed and directly readable from the plan rather than opaque learned weights.
Task interference drops during few-shot adaptation because each expert already handles a stable domain group.
Performance gains concentrate on underrepresented datasets without sacrificing overall accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planner-compiler pattern could be applied to continual learning or multi-task settings where expert specialization must be enforced without learned routers.
If the linear program can be solved at scale, the approach may extend to hundreds of domains where learned routers currently collapse.
The explicit mapping offers a diagnostic tool: domain-specific failures can be traced to the assigned expert rather than to routing noise.

Load-bearing premise

The linear-programming solution followed by hierarchical rounding will produce assignments that respect capacity limits while still allowing meaningful specialization across domains.

What would settle it

If GEM-DINO is evaluated on a held-out multi-domain benchmark and fails to outperform standard MoE models with learned routers on average accuracy or on the rarest domains, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.18842 by Dacheng Tao, Huiyu Zhou, Masoumeh Zareapoor, Oscar Mendez, Pourya Shamsolmoali, Xuelong Li.

**Figure 1.** Figure 1: Improved detection quality and convergence with GEM-DINO. The training loss shows that our GEM-DINO (red) converges [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: MoE routing vs. our proposed GEM. Traditional MoEs use [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of GEM routing. Given datasets D, experts E, and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset-Expert mapping under GEM for 12 datasets (IDs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of expert routing and its effect on performance. (Top) Heatmaps showing dataset-to-expert routing at decoder [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Expert specialization analysis of our GEM-DINO. For each of the 12 datasets (UODB+LVIS), we evaluate detection performance [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualizing expert specialization. Each row shows an [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Per-Expert performance comparison. While our main results (Table 2) show a superior average mAP, this plots provide a more [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of some detections using our proposed global expert matching. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Performance of diverse datasets as experts vary. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: The trade-off between new-task performance (WiderFace [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Class-level performance on COCO, O365, and VG. We [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GEM swaps the learned router for a global LP planner plus hierarchical rounding to fix expert assignments without balancing losses, directly targeting specialization failures on rare domains.

read the letter

GEM replaces the usual learned router in mixture-of-experts models with a global planner that uses linear programming relaxation to assign datasets to experts, followed by hierarchical rounding to make the mapping deterministic and capacity-aware. This setup skips the load-balancing loss entirely, which the paper argues is the source of poor specialization on rare domains in multi-dataset vision training. The planner-compiler structure is the main novelty. The LP computes fractional dataset-to-expert assignments upfront, and the rounding step produces a fixed mapping that respects capacities without any gating network or auxiliary loss. This is a clear departure from standard MoE routers that try to balance load while hoping for specialization. The method does well at making the routing domain-aware and interpretable by design. It tackles the fairness-specialization conflict head-on, which is a practical bottleneck when training on inconsistent datasets with varying label semantics and sizes. The claims of SOTA on UODB plus gains on underrepresented data and reduced task interference in few-shot cases follow logically if the assignments actually dedicate experts to rare domains. The soft spots center on evidence and practicality. The abstract states strong results but gives no numbers, baselines, or ablation protocol, so the full paper needs to demonstrate that the planner itself drives the gains rather than other factors like architecture tweaks. Scalability of the LP solve for larger dataset counts or expert pools is also unclear and could limit adoption if it requires heavy approximations. This paper is for vision researchers working on multi-domain training or MoE extensions who want an optimization-based alternative to learned routers. Readers focused on domain shift and robust deployment would find the assignment strategy useful to test. It deserves a serious referee because the core mechanism is distinct, reproducible in principle, and addresses a documented limitation in existing MoE setups. I recommend sending it for peer review with requests for detailed experimental comparisons and any notes on solver runtime.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GEM (Global Expert Mapping), a planner-compiler framework for multi-domain learning with Mixture-of-Experts models. It replaces the learned router with a linear-programming relaxation that computes fractional dataset-to-expert assignments, followed by hierarchical rounding to produce a deterministic, capacity-aware mapping. This is claimed to eliminate the conflict between load-balancing fairness and domain specialization that arises in standard MoE training, yielding interpretable routing without a balancing loss. Experiments are reported to show that GEM-DINO attains state-of-the-art performance on the UODB benchmark, with particular gains on underrepresented datasets and improved handling of task interference in few-shot adaptation scenarios.

Significance. If the quantitative claims hold, the work would be a meaningful contribution to multi-domain and multi-task vision learning. The global LP-based planner offers a deterministic, interpretable alternative to learned routers and balancing losses, directly addressing a known tension in MoE specialization. Credit is due for the clean separation of planning from compilation and for the focus on rare-domain performance, which is a persistent practical challenge.

major comments (2)

[Abstract and §5] Abstract and §5 (Experiments): the central claim of SOTA performance on UODB with gains on underrepresented datasets is stated without any numerical results, baseline comparisons, ablation tables, or protocol details. This absence is load-bearing because the soundness of the LP-plus-rounding mechanism cannot be assessed without evidence that the planner actually produces the claimed specialization.
[§3.2] §3.2 (Planner and Compiler): the hierarchical rounding step is described at a high level but lacks a formal statement of the capacity constraints or a proof sketch that the rounding preserves feasibility while avoiding the fairness-specialization conflict. Without this, it is unclear whether the deterministic mapping introduces new performance bottlenecks on rare domains.

minor comments (2)

[§3.1] Notation for the LP variables (e.g., fractional assignment matrix) would benefit from an explicit definition table to improve readability.
[Figure 4] Figure captions for the routing visualizations should include quantitative metrics (e.g., expert utilization per domain) rather than qualitative descriptions alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight important aspects of clarity and rigor in presenting our GEM framework. We appreciate the positive assessment of the overall approach and its potential contribution to multi-domain MoE learning. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): the central claim of SOTA performance on UODB with gains on underrepresented datasets is stated without any numerical results, baseline comparisons, ablation tables, or protocol details. This absence is load-bearing because the soundness of the LP-plus-rounding mechanism cannot be assessed without evidence that the planner actually produces the claimed specialization.

Authors: We agree that the absence of explicit numerical results in the abstract and a more detailed presentation in §5 makes it difficult for readers to immediately assess the claims. In the revised manuscript we will insert key quantitative results (overall UODB accuracy, per-domain gains on underrepresented sets, and comparisons to strong baselines) directly into the abstract. Section 5 will be expanded with full baseline tables, ablation studies isolating the planner and rounding steps, and complete experimental protocols so that the specialization behavior of the LP-derived mapping can be evaluated directly from the evidence. revision: yes
Referee: [§3.2] §3.2 (Planner and Compiler): the hierarchical rounding step is described at a high level but lacks a formal statement of the capacity constraints or a proof sketch that the rounding preserves feasibility while avoiding the fairness-specialization conflict. Without this, it is unclear whether the deterministic mapping introduces new performance bottlenecks on rare domains.

Authors: We acknowledge that the current description of hierarchical rounding is informal. In the revision we will supply a precise mathematical formulation of the capacity constraints enforced during rounding and a concise proof sketch establishing that the procedure yields a feasible integer assignment while respecting the fractional solution’s domain-expert preferences. We will also add a short discussion, supported by additional analysis, showing that the deterministic mapping does not create new bottlenecks on rare domains; our existing experiments already indicate stable performance on low-resource sets, and we will make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces GEM as an independent algorithmic replacement for learned routers in MoE models: a linear-programming relaxation computes fractional dataset-to-expert assignments, followed by hierarchical rounding to obtain a deterministic capacity-aware mapping. This construction is presented directly from first principles without reducing to fitted parameters renamed as predictions, without self-citation load-bearing for the core uniqueness claim, and without any ansatz or self-definition that equates the output to the input by construction. The claimed resolution of fairness-specialization conflict and the UODB experimental gains are asserted to follow from the explicit planner-compiler steps rather than from any re-expression of prior fitted values. No load-bearing equation or premise collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the LP-based global assignment and rounding procedure; these are introduced without external benchmarks or proofs in the abstract.

axioms (1)

domain assumption Linear programming relaxation followed by hierarchical rounding yields a capacity-aware deterministic mapping that improves specialization over learned routers
This is the core premise of the planner-compiler framework stated in the abstract.

invented entities (1)

GEM (Global Expert Mapping) planner-compiler framework no independent evidence
purpose: To compute and enforce dataset-to-expert assignments without learned routing or balancing losses
New framework proposed to resolve the fairness-specialization conflict in MoE models

pith-pipeline@v0.9.0 · 5538 in / 1286 out tokens · 41688 ms · 2026-05-10T04:41:02.798228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Zero-shot sparse mixture of low-rank experts construction from pre- trained foundation models,

A. Tang, L. Shen, Y . Luo, S. Xie, H. Hu, L. Zhang, B. Du, and D. Tao, “Zero-shot sparse mixture of low-rank experts construction from pre- trained foundation models,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

2025
[2]

Large-scale object detection in the wild with imbalanced data distribution, and multi-labels,

C. Pan, J. Peng, X. Bu, and Z. Zhang, “Large-scale object detection in the wild with imbalanced data distribution, and multi-labels,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 9255–9271, 2024

2024
[3]

Learning heterogeneous mixture of scene experts for large-scale neural radiance fields,

Z. Mi, P. Yin, X. Xiao, and D. Xu, “Learning heterogeneous mixture of scene experts for large-scale neural radiance fields,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

2025
[4]

Sparse mixture-of-experts are domain generalizable learners,

B. Li, Y . Shen, J. Yang, Y . Wang, J. Ren, T. Che, J. Zhang, and Z. Liu, “Sparse mixture-of-experts are domain generalizable learners,”Int. Conf. Learn. Represent., 2023

2023
[5]

Setformer is what you need for vision and language,

P. Shamsolmoali, M. Zareapoor, E. Granger, and M. Felsberg, “Setformer is what you need for vision and language,” inProc. AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4713–4721. 13

2024
[6]

Learning general and specific embedding with transformer for few-shot object detection,

X. Zhang, Z. Chen, J. Zhang, T. Liu, and D. Tao, “Learning general and specific embedding with transformer for few-shot object detection,”Int. J. Comput. Vis., vol. 133, no. 2, pp. 968–984, 2025

2025
[7]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEur. Conf. Comput. Vis., 2014, pp. 740–755

2014
[8]

Cross-domain weakly-supervised object detection through progressive domain adapta- tion,

N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domain weakly-supervised object detection through progressive domain adapta- tion,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 5001–5009

2018
[9]

Efficient parametrization of multi-domain deep neural networks,

S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Efficient parametrization of multi-domain deep neural networks,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 8119–8127

2018
[10]

Damex: Dataset-aware mixture- of-experts for visual understanding of mixture-of-datasets,

Y . Jain, H. Behl, Z. Kira, and V . Vineet, “Damex: Dataset-aware mixture- of-experts for visual understanding of mixture-of-datasets,”Adv. Neural Inform. Process. Syst., vol. 36, 2024

2024
[11]

Simple multi-dataset detection,

X. Zhou, V . Koltun, and P. Kr¨ahenb¨uhl, “Simple multi-dataset detection,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 7571–7580

2022
[12]

Towards universal object detection by domain attention,

X. Wang, Z. Cai, D. Gao, and N. Vasconcelos, “Towards universal object detection by domain attention,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 7289–7298

2019
[13]

Detection hub: Unifying object detection datasets via query adaptation on language embedding,

L. Meng, X. Dai, Y . Chen, P. Zhang, D. Chen, M. Liu, J. Wang, Z. Wu, L. Yuan, and Y .-G. Jiang, “Detection hub: Unifying object detection datasets via query adaptation on language embedding,” inIEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 11 402–11 411

2023
[14]

Multi-dataset, multitask learning of egocentric vision tasks,

G. Kapidis, R. Poppe, and R. C. Veltkamp, “Multi-dataset, multitask learning of egocentric vision tasks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6618–6630, 2022

2022
[15]

Plain-det: A plain multi-dataset object detector,

C. Shi, Y . Zhu, and S. Yang, “Plain-det: A plain multi-dataset object detector,”Eur. Conf. Comput. Vis., 2024

2024
[16]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”Int. Conf. Learn. Represent., 2017

2017
[17]

Remoe: Fully differentiable mixture-of- experts with relu routing,

Z. Wang, J. Zhu, and J. Chen, “Remoe: Fully differentiable mixture-of- experts with relu routing,”Int. Conf. Learn. Represent., 2025

2025
[18]

Mergeme: Model merging techniques for homogeneous and heterogeneous moes,

Y . Zhou, G. Karamanolakis, V . Soto, A. Rumshisky, M. Kulkarni, F. Huang, W. Ai, and J. Lu, “Mergeme: Model merging techniques for homogeneous and heterogeneous moes,”Conf. North Amer. Chapter Assoc. Comput. Linguistics, 2025

2025
[19]

Mocae: Mixture of calibrated experts significantly improves object detection,

K. Oksuz, S. Kuzucu, T. Joy, and P. K. Dokania, “Mocae: Mixture of calibrated experts significantly improves object detection,”Trans. Mach. Learn. Res., 2024

2024
[20]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 120, pp. 1–39, 2022

2022
[21]

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wuet al., “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,”Proc. ACL, 2024

2024
[22]

Multilinear mixture of experts: Scalable expert specialization through factorization,

J. Oldfield, M. Georgopoulos, G. Chrysos, C. Tzelepis, Y . Panagakis, M. Nicolaou, J. Deng, and I. Patras, “Multilinear mixture of experts: Scalable expert specialization through factorization,”Adv. Neural Inform. Process. Syst., vol. 37, pp. 53 022–53 063, 2024

2024
[23]

Load balancing mixture of experts with similarity preserving routers,

N. Omi, S. Sen, and A. Farhadi, “Load balancing mixture of experts with similarity preserving routers,”arXiv preprint arXiv:2506.14038, 2025

work page arXiv 2025
[24]

Uni-moe: Scaling unified multimodal llms with mixture of experts,

Y . Li, S. Jiang, B. Hu, L. Wang, W. Zhong, W. Luo, L. Ma, and M. Zhang, “Uni-moe: Scaling unified multimodal llms with mixture of experts,” IEEE Trans. Pattern Anal. Mach. Intell., 2025

2025
[25]

Buffer overflow in mixture of experts,

J. Hayes, I. Shumailov, and I. Yona, “Buffer overflow in mixture of experts,”Adv. Neural Inform. Process. Syst. Workshop, 2024

2024
[26]

Harder tasks need more experts: Dynamic routing in moe models,

Q. Huang, Z. An, N. Zhuang, M. Tao, C. Zhang, Y . Jin, K. Xu, L. Chen, S. Huang, and Y . Feng, “Harder tasks need more experts: Dynamic routing in moe models,”Proc. Annu. Meet. Assoc. Comput. Linguistics, 2024

2024
[27]

Machine learning in compiler optimization,

Z. Wang and M. O’Boyle, “Machine learning in compiler optimization,” Proc. IEEE, vol. 106, no. 11, pp. 1879–1901, 2018

1901
[28]

Machine-learning-based self-optimizing compiler heuristics,

R. Mosaner, D. Leopoldseder, W. Kisling, L. Stadler, and H. M ¨ossenb¨ock, “Machine-learning-based self-optimizing compiler heuristics,” inProc. Int. Conf. Managed Program. Lang. Runtimes, 2022, pp. 98–111

2022
[29]

Improved deterministic distributed matching via rounding,

M. Fischer, “Improved deterministic distributed matching via rounding,” Distributed Computing, vol. 33, no. 3, pp. 279–291, 2020

2020
[30]

Hierarchical clustering via spreading metrics,

A. Roy and S. Pokutta, “Hierarchical clustering via spreading metrics,” J. Mach. Learn. Res., vol. 18, no. 88, pp. 1–35, 2017

2017
[31]

Dynamic algorithms for packing-covering lps via multiplicative weight updates,

S. Bhattacharya, P. Kiss, and T. Saranurak, “Dynamic algorithms for packing-covering lps via multiplicative weight updates,” inACM-SIAM Symposium on Discrete Algo., 2023, pp. 1–47

2023
[32]

Base layers: Simplifying training of large, sparse models,

M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “Base layers: Simplifying training of large, sparse models,” inInt. Conf. Mach. Learn., 2021, pp. 6265–6274

2021
[33]

Unified scaling laws for routed language models,

A. Clark, D. de Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaudet al., “Unified scaling laws for routed language models,” inInt. Conf. Mach. Learn., 2022, pp. 4057– 4086

2022
[34]

Sparsity-constrained optimal transport,

T. Liu, J. Puigcerver, and M. Blondel, “Sparsity-constrained optimal transport,”Int. Conf. Learn. Represent., 2022

2022
[35]

Dino: Detr with improved denoising anchor boxes for end-to-end object detection,

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,”Int. Conf. Learn. Represent., 2023

2023
[36]

From sparse to soft mixtures of experts,

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,”Int. Conf. Learn. Represent., 2024

2024
[37]

Moe++: Accelerating mixture- of-experts methods with zero-computation experts,

P. Jin, B. Zhu, L. Yuan, and S. Yan, “Moe++: Accelerating mixture- of-experts methods with zero-computation experts,”Int. Conf. Learn. Represent., 2025

2025
[38]

Scaling vision with sparse mixture of experts,

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,”Adv. Neural Inform. Process. Syst., vol. 34, pp. 8583– 8595, 2021

2021
[39]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEur. Conf. Comput. Vis., 2024, pp. 38– 55

2024
[40]

Multi-source and multi- target domain adaptation based on dynamic generator with attention,

Y . Lu, H. Huang, B. Zeng, Z. Lai, and X. Li, “Multi-source and multi- target domain adaptation based on dynamic generator with attention,” IEEE Trans. Multimedia, vol. 26, pp. 6891–6905, 2024

2024
[41]

Generalizing to new tasks via one-shot compositional subgoals,

B. Xihan, O. Mendez, Z. Lianpin, and S. Hadfield, “Generalizing to new tasks via one-shot compositional subgoals,” inInt. Conf. Auto. Robot. Appl., 2024, pp. 491–495

2024
[42]

Omnivore: A single model for many visual modalities,

R. Girdhar, M. Singh, N. Ravi, L. Van Der Maaten, A. Joulin, and I. Misra, “Omnivore: A single model for many visual modalities,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 102–16 112

2022
[43]

Detecting 11k classes: Large scale object detection without fine-grained bounding boxes,

H. Yang, H. Wu, and H. Chen, “Detecting 11k classes: Large scale object detection without fine-grained bounding boxes,” inInt. Conf. Comput. Vis., 2019, pp. 9805–9813

2019
[44]

Glam: Efficient scaling of language models with mixture-of-experts,

N. Du, Y . Huang, A. M. Dai, S. Tong, D. Lepikhin, Y . Xu, M. Krikun, Y . Zhou, A. W. Yu, O. Firatet al., “Glam: Efficient scaling of language models with mixture-of-experts,” inInt. Conf. Mach. Learn., 2022, pp. 5547–5569

2022
[45]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Zhao, A. M. Dai, Q. V . Le, J. Laudonet al., “Mixture-of-experts with expert choice routing,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 7103–7114, 2022

2022
[46]

Scaling laws for fine-grained mixture of experts,

J. Krajewski, J. Ludziejewski, K. Adamczewski, M. Pi ´oro, M. Krutul, S. Antoniak, K. Ciebiera, K. Kr ´ol, T. Odrzyg ´o´zd´z, P. Sankowskiet al., “Scaling laws for fine-grained mixture of experts,”Int. Conf. Mach. Learn., 2024

2024
[47]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Continuous action reinforcement learning from a mixture of interpretable experts,

R. Akrour, D. Tateo, and J. Peters, “Continuous action reinforcement learning from a mixture of interpretable experts,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6795–6806, 2022

2022
[49]

On the representation collapse of sparse mixture of experts,

Z. Chi, L. Dong, S. Huang, D. Dai, S. Ma, B. Patra, S. Singhal, P. Bajaj, X. Song, X.-L. Maoet al., “On the representation collapse of sparse mixture of experts,”Adv. Neural Inform. Process. Syst., vol. 35, pp. 34 600–34 613, 2022

2022
[50]

Regularized box- simplex games and dynamic decremental bipartite matching,

A. Jambulapati, Y . Jin, A. Sidford, and K. Tian, “Regularized box- simplex games and dynamic decremental bipartite matching,”Int. Collo- quium on Automata, Lang. Programming, 2022

2022
[51]

An approximation algorithm for the generalized assignment problem,

D. B. Shmoys and ´E. Tardos, “An approximation algorithm for the generalized assignment problem,”Math. Programm., vol. 62, no. 1, pp. 461–474, 1993

1993
[52]

Deterministic decre- mental reachability, scc, and shortest paths via directed expanders and congestion balancing,

A. Bernstein, M. P. Gutenberg, and T. Saranurak, “Deterministic decre- mental reachability, scc, and shortest paths via directed expanders and congestion balancing,” inFoundations of Comp. Sci., 2020, pp. 1123– 1134

2020
[53]

Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow,

Y . T. Lee and A. Sidford, “Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maximum flow,” inFoundations of Comp. Sci., 2014, pp. 424–433

2014
[54]

Entropy regularization and faster decremental matching in general graphs,

J. Chen, A. Sidford, and T.-W. Tu, “Entropy regularization and faster decremental matching in general graphs,” inProc. ACM-SIAM Symp. Discrete Algorithms, 2025, pp. 3069–3115. 14

2025
[55]

Lvis: A dataset for large vocabu- lary instance segmentation,

A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabu- lary instance segmentation,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5356–5364

2019
[56]

The pascal visual object classes challenge: A retrospec- tive,

M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospec- tive,”Int. J. Comput. Vis., vol. 111, pp. 98–136, 2015

2015
[57]

Wider face: A face detection benchmark,

S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection benchmark,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016

2016
[58]

Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large- scale lesion database,

K. Yan, X. Wang, L. Lu, L. Zhang, A. P. Harrison, M. Bagheri, and R. M. Summers, “Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large- scale lesion database,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9261–9270

2018
[59]

kitchen-object-detection dataset,

kitchenobjectdetection, “kitchen-object-detection dataset,” 2022

2022
[60]

Dota: A large-scale dataset for object detection in aerial images,

G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 3974–3983

2018
[61]

Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey,

A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey,”IEEE Trans Intell Transp Syst., vol. 13, no. 4, pp. 1484–1497, 2012

2012
[62]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inIEEE Conf. Comput. Vis. Pattern Recog., 2012, pp. 3354–3361

2012
[63]

k-means++: The advantages of careful seeding,

D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” inProc. ACM-SIAM Symp. Discrete Algorithms, 2007, pp. 1027–1035

2007
[64]

Objects365: A large-scale, high-quality dataset for object detection,

S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Int. Conf. Comput. Vis., 2019, pp. 8430–8439

2019
[65]

Unbiased scene graph generation from biased training,

K. Tang, Y . Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 3716–3725. Pourya Shamsolmoali(Senior Member, IEEE) received the PhD degree in computer science from Shanghai Jiao Tong University. He has been a visiting researcher with Link ¨oping Uni- versity, IN...

2020