arxiv: 2604.18256 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.LG

Recognition: unknown

Domain-Specialized Object Detection via Model-Level Mixtures of Experts

Svetlana Pavlitska , Malte St\"uven , Beyza Keskin , J. Marius Z\"ollner

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords mixture of expertsobject detectionYOLOgating networkdomain specializationexpert collapseBDD100Kmodel ensembling

0 comments

The pith

Mixtures of experts with domain-specialized detectors outperform standard ensembles in object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a model-level mixture of experts, where each expert is a full object detector trained on a semantically disjoint data subset, can improve both detection accuracy and interpretability over conventional ensembles. It addresses the challenge of merging dense, structured predictions from multiple detectors by using a learned gating network and various output fusion strategies, plus balancing losses to avoid expert collapse. A sympathetic reader would care because real-world object detection often spans diverse conditions, and this structured specialization offers a more interpretable way to combine models than black-box averaging. Experiments on a large driving dataset back the claim by showing consistent gains alongside visible patterns of expert focus on different domains.

Core claim

We propose an MoE architecture that combines YOLO-based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains.

What carries the argument

A model-level mixture of experts in which each expert is a complete YOLO detector trained on a semantically disjoint subset, fused by a trainable gating network that learns dynamic weights for the dense outputs.

If this is right

The MoE approach yields higher accuracy than conventional ensembling when detectors are trained on domain-disjoint data.
Insights into expert specialization across domains become available through the gating weights and per-expert performance breakdowns.
Balancing losses during gating training prevents expert collapse and preserves overall detection quality.
Model-level MoEs provide a practical alternative to traditional ensembles for tasks with dense, structured outputs like object detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could be tested on other dense-prediction tasks such as semantic segmentation or depth estimation to check whether fusion strategies transfer.
New experts could be added incrementally for novel domains without retraining the full system, enabling efficient lifelong adaptation.
The observed specialization patterns might guide targeted data collection or active learning strategies that focus on underrepresented domains.
In safety-critical settings the gating decisions could offer post-hoc explanations of which domain expert handled a given input.

Load-bearing premise

Training detectors on semantically disjoint subsets will produce meaningfully specialized experts whose outputs the gating network can fuse without collapse or loss of accuracy.

What would settle it

On the BDD100K dataset the mixture-of-experts model shows no mAP gain over standard ensembles or the experts exhibit no distinct specialization patterns in their error profiles.

Figures

Figures reproduced from arXiv: 2604.18256 by Beyza Keskin, J. Marius Z\"ollner, Malte St\"uven, Svetlana Pavlitska.

**Figure 2.** Figure 2: Proposed MoE architecture consisting of two expert [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Exemplary images with different values of the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Distribution of BDD100K data according to the metadata [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Distribution of expert weights, predicted by a simple gate [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Disagreement analysis helps trace MoE decisions back [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models provide a structured approach to combining specialized neural networks and offer greater interpretability than conventional ensembles. While MoEs have been successfully applied to image classification and semantic segmentation, their use in object detection remains limited due to challenges in merging dense and structured predictions. In this work, we investigate model-level mixtures of object detectors and analyze their suitability for improving performance and interpretability in object detection. We propose an MoE architecture that combines YOLO-based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains, highlighting model-level MoEs as a viable alternative to traditional ensembling for object detection. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper proposes a model-level Mixture of Experts (MoE) architecture for object detection that trains separate YOLO-based detectors on semantically disjoint subsets of the BDD100K dataset and combines their dense outputs via a learned gating network. Multiple fusion strategies are examined, including score-weighted NMS and per-class gating, together with explicit balancing losses in the gating objective to avoid expert collapse. Experiments demonstrate that the resulting MoE consistently outperforms standard ensemble baselines while yielding interpretable specialization patterns across domains.

Significance. If the reported gains hold, the work provides a practical and interpretable alternative to conventional ensembling for multi-domain object detection. By showing that domain-specialized experts can be fused without collapse and that the gating network learns meaningful routing, the manuscript extends MoE techniques from classification/segmentation to structured prediction tasks and supplies concrete ablation evidence on specialization metrics.

minor comments (4)

The abstract states that the MoE 'consistently outperforms standard ensemble approaches' but does not report the magnitude of the mAP gains or the number of runs; adding these quantitative highlights would strengthen the summary.
Section 4.3 (fusion strategies) introduces per-class gating and score-weighted NMS; a short pseudocode block or explicit equation for the final suppression step would improve reproducibility.
Figure 4 (expert activation histograms) would benefit from axis labels that explicitly tie activation frequency to the domain-specific mAP improvements listed in Table 2.
The limitations paragraph in the conclusion mentions increased inference cost but does not quantify the overhead relative to a single YOLO model; a brief runtime table would be useful.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee's description accurately captures the core contributions of our model-level MoE approach for domain-specialized object detection using YOLO experts on BDD100K, including the gating network, fusion strategies, and balancing losses to prevent collapse. As no specific major comments were listed in the report, we have no point-by-point rebuttals to provide. We will address any minor issues identified during the revision process.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes and empirically evaluates a model-level MoE for object detection by training YOLO experts on semantically disjoint subsets of BDD100K and learning a gating network (with balancing losses) to fuse outputs via strategies such as score-weighted NMS. All central claims rest on external experimental comparisons (mAP gains, expert activation histograms, ablations) rather than any derivation, prediction, or uniqueness theorem that reduces to the paper's own fitted parameters or self-citations. No equations, ansatzes, or load-bearing self-references appear in the provided text; the architecture and training procedure are explicitly described as learned and measured against independent baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no identifiable free parameters, axioms, or invented entities; the approach rests on standard supervised training of neural networks and gating without new postulates.

pith-pipeline@v0.9.0 · 5492 in / 1096 out tokens · 52225 ms · 2026-05-10T04:23:48.035783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural computation, vol. 3, no. 1, 1991

1991
[2]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”CoRR, vol. abs/2101.03961, 2021

work page internal anchor Pith review arXiv 2021
[3]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wuet al., “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,”arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review arXiv 2024
[4]

Scaling vision with sparse mixture of experts,

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,” inAdvances in Neural Information Processing Systems (NIPS), 2021

2021
[5]

Gshard: Scaling giant models with conditional computation and automatic sharding,

H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” inInternational Conference on Learning Representations (ICLR), 2021

2021
[6]

Convoluted Mixture of Deep Experts for Robust Semantic Segmentation,

A. Valada, A. Dhall, and W. Burgard, “Convoluted Mixture of Deep Experts for Robust Semantic Segmentation,” inInternational Conference on Intelligent Robots and Systems (IROS) - Workshops, 2016

2016
[7]

Using mixture of expert models to gain insights into semantic segmentation,

S. Pavlitskaya, C. Hubschneider, M. Weber, R. Moritz, F. H ¨uger, P. Schlicht, and J. M. Z ¨ollner, “Using mixture of expert models to gain insights into semantic segmentation,” inConference on Computer Vision and Pattern Recognition (CVPR) - Workshops, 2020

2020
[8]

Network of experts for large- scale image categorization,

K. Ahmed, M. H. Baig, and L. Torresani, “Network of experts for large- scale image categorization,” inEuropean Conference on Computer Vision (ECCV). Springer, 2016

2016
[9]

Choosing smartly: Adaptive multimodal fusion for object detection in changing environments,

O. Mees, A. Eitel, and W. Burgard, “Choosing smartly: Adaptive multimodal fusion for object detection in changing environments,” in International Conference on Intelligent Robots and Systems (IROS), 2016

2016
[10]

You only look once: Unified, real-time object detection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inConference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[11]

BDD100K: A diverse driving dataset for heterogeneous multitask learning,

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “BDD100K: A diverse driving dataset for heterogeneous multitask learning,” inConference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[12]

Towards A Rigorous Science of Interpretable Machine Learning

F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,”arXiv preprint arXiv:1702.08608, 2017

work page internal anchor Pith review arXiv 2017
[13]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,”Nat. Mach. Intell., 2019

2019
[14]

Weighted boxes fusion: Ensembling boxes from different object detection models,

R. Solovyev, W. Wang, and T. Gabruseva, “Weighted boxes fusion: Ensembling boxes from different object detection models,”Image and Vision Computing, 2021

2021
[15]

Cad: Scale invariant framework for real-time object detection,

H. Zhou, Z. Li, C. Ning, and J. Tang, “Cad: Scale invariant framework for real-time object detection,” inInternational Conference on Computer Vision (ICCV) - Workshops, 2017

2017
[16]

Soft-nms–improving object detection with one line of code,

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-nms–improving object detection with one line of code,” inInternational Conference on Computer Vision (ICCV), 2017

2017
[17]

A forest fire detection system based on ensemble learning,

R. Xu, H. Lin, K. Lu, L. Cao, and Y . Liu, “A forest fire detection system based on ensemble learning,”Forests, 2021

2021
[18]

Ensemble methods for object detection,

´A. Casado-Garc´ıa and J. Heras, “Ensemble methods for object detection,” inEuropean Conference on Artificial Intelligence (ECAI), 2020

2020
[19]

Ensemble r-fcn for object detection,

J. Li, J. Qian, and Y . Zheng, “Ensemble r-fcn for object detection,” in International Conference on Ubiquitous Information Technologies and Applications, 2017

2017
[20]

Megdet: A large mini-batch object detector,

C. Peng, T. Xiao, Z. Li, Y . Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun, “Megdet: A large mini-batch object detector,” inConference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[21]

Deep cnn ensemble with data augmentation for object detection,

J. Guo and S. Gould, “Deep cnn ensemble with data augmentation for object detection,”arXiv preprint arXiv:1506.07224, 2015

work page arXiv 2015
[22]

Cascade r-cnn: Delving into high quality object detection,

Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inConference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[23]

An ensemble method of cnn models for object detection,

J. Lee, S.-K. Lee, and S.-I. Yang, “An ensemble method of cnn models for object detection,” inInternational Conference on Information and Communication Technology Convergence (ICTC), 2018

2018
[24]

Ensemble of deep object detectors for page object detection,

N. D. V o, K. Nguyen, T. V . Nguyen, and K. Nguyen, “Ensemble of deep object detectors for page object detection,” inInternational Conference on Ubiquitous Information Management and Communication, 2018

2018
[25]

Wildfire and smoke detection using staged yolo model and ensemble cnn,

C. Bahhar, A. Ksibi, M. Ayadi, M. M. Jamjoom, Z. Ullah, B. O. Soufiene, and H. Sakli, “Wildfire and smoke detection using staged yolo model and ensemble cnn,”Electronics, vol. 12, 2023

2023
[26]

YOLOv5: A state-of-the-art real-time object detection system,

Ultralytics, “YOLOv5: A state-of-the-art real-time object detection system,” 2021

2021
[27]

A yolo-based approach for fire and smoke detection in iot surveillance systems

D. Zhang, “A yolo-based approach for fire and smoke detection in iot surveillance systems.”International Journal of Advanced Computer Science & Applications, vol. 15, no. 1, 2024

2024
[28]

Incremental object detection using ensemble modeling and deep transfer learning,

P. Huayhongthong, S. Rerk-u suk, S. Booddee, P. Padungweang, and K. Warasup, “Incremental object detection using ensemble modeling and deep transfer learning,” inInternational Conference on Computing and Information Technology, 2020

2020
[29]

Learning a mixture of granularity-specific experts for fine-grained categorization,

L. Zhang, S. Huang, W. Liu, and D. Tao, “Learning a mixture of granularity-specific experts for fine-grained categorization,” inInter- national Conference on Computer Vision (ICCV), 2019

2019
[30]

Fg-moe: Heterogeneous mixture of experts model for fine-grained visual classification,

S. Yang, J. Wen, and B. Fang, “Fg-moe: Heterogeneous mixture of experts model for fine-grained visual classification,”Pattern Recognition, p. 113050, 2026

2026
[31]

Mocae: Mixture of calibrated experts significantly improves object detection,

K. Oksuz, S. Kuzucu, T. Joy, and P. K. Dokania, “Mocae: Mixture of calibrated experts significantly improves object detection,”Trans. Mach. Learn. Res., 2024

2024
[32]

Evaluating mixture-of- experts architectures for network aggregation,

S. Pavlitskaya, C. Hubschneider, and M. Weber, “Evaluating mixture-of- experts architectures for network aggregation,” inDeep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety, 2022

2022
[33]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Sparsely- gated mixture-of-expert layers for cnn interpretability,

S. Pavlitska, C. Hubschneider, L. Struppek, and J. M. Z ¨ollner, “Sparsely- gated mixture-of-expert layers for cnn interpretability,” inInternational Joint Conference on Neural Networks (IJCNN), 2023

2023
[35]

Robust experts: the effect of adversarial training on cnns with sparse mixture-of-experts layers,

S. Pavlitska, H. Fan, K. Ditschuneit, and J. M. Z ¨ollner, “Robust experts: the effect of adversarial training on cnns with sparse mixture-of-experts layers,” inInternational Conference on Computer Vision (ICCV) - Workshops, 2025

2025
[36]

Yolov7: Trainable bag- of-freebies sets new state-of-the-art for real-time object detectors,

C. Wang, A. Bochkovskiy, and H. M. Liao, “Yolov7: Trainable bag- of-freebies sets new state-of-the-art for real-time object detectors,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[37]

Yolov9: Learning what you want to learn using programmable gradient information,

C. Wang, I. Yeh, and H. M. Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” inEuropean Conference on Computer Vision (ECCV), 2024

2024
[38]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755

2014