pith. sign in

arxiv: 2606.00844 · v1 · pith:HBALEI5Snew · submitted 2026-05-30 · 💻 cs.CV · cs.AI· cs.LG

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

Pith reviewed 2026-06-28 18:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords bounding-box regressionmixture of expertsIoU lossobject detectioncurriculum weightinglocalization accuracyYOLO
0
0 comments X

The pith

MoEIoU models bounding-box errors as experts and aggregates them with log-sum-exp plus curriculum weighting to improve regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fixed geometric penalties in IoU losses fail to match the shifting error profile during training, where early stages need position and shape fixes while later stages need overlap refinement. MoEIoU instead treats overlap, center alignment, and aspect-ratio mismatch as separate experts whose contributions are combined by log-sum-exp so the largest term dominates while others add smoothly. A curriculum schedule then gradually shifts priority from position and shape correction to overlap improvement. When inserted into YOLO detectors this produces faster convergence and tighter final boxes than standard or recent IoU losses across PASCAL VOC, HRIPCB, and MS COCO. A reader would care because any loss that supplies better optimization signals without changing network architecture directly raises detection accuracy.

Core claim

MoEIoU is a mixture-of-experts regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. It aggregates the three terms with a log-sum-exp function that emphasizes the dominant localization error at each step while keeping contributions from the others smooth. A curriculum-based weighting schedule prioritizes box position and shape in early training and overlap in later stages. Large-scale experiments on multiple YOLO architectures and the listed datasets show consistent gains over both classic and recent state-of-the-art IoU losses in convergence speed and localization accuracy. The same adaptive aggregation can be applied to existing IoU-based losses to yield

What carries the argument

MoEIoU loss, which aggregates overlap, center-distance, and aspect-ratio terms via log-sum-exp and applies curriculum weighting.

If this is right

  • Object detectors converge faster when trained with the adaptive loss.
  • Localization accuracy rises on PASCAL VOC, HRIPCB, and MS COCO.
  • Multiple YOLO architectures obtain measurable gains from the method.
  • The adaptive aggregation produces consistent improvements when added to other existing IoU-based losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dynamic weighting may improve multi-term losses in other vision tasks such as keypoint detection.
  • The mixture structure could be tested on three-dimensional bounding-box regression without architectural changes.
  • It suggests that stage-aware error emphasis may be more important than the precise functional form of any single penalty term.

Load-bearing premise

The log-sum-exp aggregation combined with the curriculum schedule supplies genuinely superior optimization dynamics rather than simply reweighting the same three terms in a way that happens to fit the chosen schedules and datasets.

What would settle it

Train identical YOLO models on the same MS COCO splits with MoEIoU versus standard GIoU or DIoU and observe no difference in final mAP or number of epochs to reach a given mAP threshold.

Figures

Figures reproduced from arXiv: 2606.00844 by Priyanka Bagade, Vinay Edula.

Figure 1
Figure 1. Figure 1: Overview of the proposed MoEIoU loss. However, IoU yields zero gradients for non-overlapping boxes, stalling optimization in early training [18]. GIoU addressed this by incorporating the smallest enclosing box as an additional penalty [18], but this term can encourage the predicted box to enlarge rather than move toward the ground truth, and it vanishes once the enclosing box equals the union of the two bo… view at source ↗
Figure 2
Figure 2. Figure 2: IoU and logarithmic IoU penalty. (a) IoU is computed as the ratio between the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Geometric components used in MoEIoU. (a) The center-distance term measures [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of hard maximum and Log-Sum-Exp surfaces over [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gradient redistribution under different temperature values [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Simulation-based comparison of bounding-box regression losses. (a) Mean IoU [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of bounding box predictions produced by [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of validation mAP curves and final-epoch mAP gains. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MoEIoU, a mixture-of-experts loss for bounding-box regression in object detection. It models overlap, center-distance, and aspect-ratio mismatch as components aggregated via log-sum-exp to emphasize the dominant error, combined with a curriculum schedule that prioritizes position and shape early and overlap later. Evaluations on PASCAL VOC, HRIPCB, and MS COCO with multiple YOLO architectures, plus large-scale simulations, show consistent outperformance over standard and recent IoU losses with faster convergence and better localization.

Significance. If the improvements are due to the adaptive aggregation rather than the curriculum alone, this could provide a useful framework for designing regression losses that account for training dynamics. The inclusion of large-scale simulation experiments is a strength for validating the approach beyond specific datasets. However, the significance is tempered by the need to confirm the source of the gains.

major comments (3)
  1. [Experimental evaluation] The manuscript does not include an ablation study that compares MoEIoU to a version using a simple convex combination or time-varying weights of the same three geometric terms while retaining the curriculum schedule. This is necessary to rule out that the reported gains stem from the curriculum reweighting rather than the log-sum-exp MoE aggregation, as the latter is a smooth approximation to max that may not introduce fundamentally new dynamics.
  2. [Loss formulation] The paper claims the log-sum-exp emphasizes the dominant localization error, but without a derivation or analysis showing how this differs from standard weighted sums in terms of gradient flow or convergence properties, the advantage over existing penalties remains unclear.
  3. [Results tables] The reported performance improvements lack accompanying standard deviations, number of runs, or statistical tests, making it difficult to assess the robustness of the outperformance claims across the YOLO architectures and datasets.
minor comments (2)
  1. [Abstract] The abstract mentions 'large-scale simulation experiments' but does not specify what these simulations entail or their purpose in the main text.
  2. Ensure that the curriculum transition schedule parameters are clearly documented to allow reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental evaluation] The manuscript does not include an ablation study that compares MoEIoU to a version using a simple convex combination or time-varying weights of the same three geometric terms while retaining the curriculum schedule. This is necessary to rule out that the reported gains stem from the curriculum reweighting rather than the log-sum-exp MoE aggregation, as the latter is a smooth approximation to max that may not introduce fundamentally new dynamics.

    Authors: We agree that an ablation isolating the log-sum-exp aggregation from the curriculum is needed to substantiate the contribution of the MoE component. In the revised manuscript we will add this ablation, comparing MoEIoU against a convex-combination baseline that retains the identical curriculum schedule on the same three geometric terms. revision: yes

  2. Referee: [Loss formulation] The paper claims the log-sum-exp emphasizes the dominant localization error, but without a derivation or analysis showing how this differs from standard weighted sums in terms of gradient flow or convergence properties, the advantage over existing penalties remains unclear.

    Authors: We will add a dedicated analysis section deriving the gradient-flow behavior of the log-sum-exp aggregator relative to fixed weighted sums, showing how the soft-max-like emphasis on the dominant term alters the optimization trajectory and convergence properties. revision: yes

  3. Referee: [Results tables] The reported performance improvements lack accompanying standard deviations, number of runs, or statistical tests, making it difficult to assess the robustness of the outperformance claims across the YOLO architectures and datasets.

    Authors: We will rerun the key experiments with multiple random seeds, report standard deviations in all tables, and include paired statistical significance tests to quantify the robustness of the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity; loss is explicitly constructed from standard terms and gains are empirical

full rationale

The paper defines MoEIoU directly as log-sum-exp aggregation of the three conventional geometric penalties (IoU overlap, center distance, aspect ratio) plus an explicit curriculum schedule on their weights. The claimed superiority is presented solely as an empirical outcome measured on held-out datasets (PASCAL VOC, HRIPCB, MS COCO) and YOLO backbones; no equation or theorem inside the paper equates the reported performance numbers to any quantity that was fitted or assumed inside the same derivation. No self-citations are invoked to justify uniqueness or to close the argument. The construction is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the precise number and values of any mixture weights or curriculum transition points cannot be audited; the method relies on the standard properties of the log-sum-exp function and on the empirical claim that the three geometric terms are usefully separable.

free parameters (1)
  • curriculum transition schedule
    The weighting between position/shape and overlap terms changes over training; the functional form and any fitted constants that control the schedule are free parameters.
axioms (1)
  • standard math log-sum-exp provides a smooth, differentiable approximation to the maximum of several terms
    Invoked to aggregate the expert outputs while preserving gradient flow.

pith-pipeline@v0.9.1-grok · 5782 in / 1308 out tokens · 35210 ms · 2026-06-28T18:49:59.161510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY , USA, 2009. Association for Com- puting Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380

  2. [2]

    Small defect detection in printed circuit boards based on the multiscale edge strengthening and an improved yolov10.Scientific Reports, 15, 10 2025

    Weixun Chen, Xuneng Ke, and Siming Meng. Small defect detection in printed circuit boards based on the multiscale edge strengthening and an improved yolov10.Scientific Reports, 15, 10 2025. doi: 10.1038/s41598-025-20387-x

  3. [3]

    The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 06 2010

    Mark Everingham, Luc Van Gool, Christopher Williams, John Winn, and Andrew Zis- serman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 06 2010. doi: 10.1007/s11263-009-0275-4

  4. [4]

    Siou loss: More powerful learning for bounding box regression, 2022

    Zhora Gevorgyan. Siou loss: More powerful learning for bounding box regression, 2022

  5. [5]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Ross Girshick. Fast r-cnn. In2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015. doi: 10.1109/ICCV .2015.169

  6. [6]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017

  7. [7]

    Hripcb: a challenging dataset for pcb defects detection and classification.The Journal of Engineering, 2020(13): 303–309, 2020

    Weibo Huang, Peng Wei, Manhua Zhang, and Hong Liu. Hripcb: a challenging dataset for pcb defects detection and classification.The Journal of Engineering, 2020(13): 303–309, 2020. doi: https://doi.org/10.1049/joe.2019.1183

  8. [8]

    Peter J. Huber. Robust Estimation of a Location Parameter.The Annals of Math- ematical Statistics, 35(1):73 – 101, 1964. doi: 10.1214/aoms/1177703732. URL https://doi.org/10.1214/aoms/1177703732

  9. [9]

    NeuralComputation3,79–87

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3:79–87, 1991. doi: https://doi.org/10.1162/neco.1991.3.1.79

  10. [10]

    Ultralytics yolo26, 2026

    Glenn Jocher and Jing Qiu. Ultralytics yolo26, 2026. URLhttps://github. com/ultralytics/ultralytics

  11. [11]

    Ultralytics YOLO, January 2023

    Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics YOLO, January 2023. URL https://github.com/ultralytics/ultralytics

  12. [12]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common ob- jects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuyte- laars, editors,Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing. ISBN 978-3-319-1060...

  13. [13]

    Powerful- iou: More straightforward and faster bounding box regression loss with a nonmono- tonic focusing mechanism.Neural Networks, 170:276–284, 2024

    Can Liu, Kaige Wang, Qing Li, Fazhan Zhao, Kun Zhao, and Hongtu Ma. Powerful- iou: More straightforward and faster bounding box regression loss with a nonmono- tonic focusing mechanism.Neural Networks, 170:276–284, 2024. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2023.11.041

  14. [14]

    Sgdr: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations, 2017

  15. [15]

    Unified-iou: For high-quality object detection, 2024

    Xiangjie Luo, Zhihao Cai, Bo Shao, and Yingxun Wang. Unified-iou: For high-quality object detection, 2024

  16. [16]

    Improving noise.ACM Trans

    Ken Perlin. Improving noise.ACM Trans. Graph., 21(3):681–682, July 2002. ISSN 0730-0301. doi: 10.1145/566654.566636. URLhttps://doi.org/10.1145/ 566654.566636

  17. [17]

    You Only Look Once: Unified, Real-Time Object Detection

    Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection.2016 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 779–788, 2015. doi: https: //doi.org/10.48550/arXiv.1506.02640. URLhttps://api.semanticscholar. org/CorpusID:206594738

  18. [18]

    Generalized Intersection Over Union: A Metric and a Loss for Bound- ing Box Regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Sil- vio Savarese. Generalized Intersection Over Union: A Metric and a Loss for Bound- ing Box Regression . In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, Los Alamitos, CA, USA, June 2019. IEEE Com- puter Society. doi: 10.1109/CVPR.2019.00075

  19. [19]

    Houghnet: Integrating near and long-range evidence for bottom-up object detection

    Nermin Samet, Samet Hicsonmez, and Emre Akbas. Houghnet: Integrating near and long-range evidence for bottom-up object detection. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 406–423, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58594-5. doi: 10.1007/978-3-030-58595-2_25

  20. [20]

    Yolov12: Attention-centric real-time object detectors, 2025

    Yunjie Tian, Qixiang Ye, and David Doermann. Yolov12: Attention-centric real-time object detectors, 2025

  21. [21]

    Wise-iou: Bounding box regres- sion loss with dynamic focusing mechanism, 2023

    Zanjia Tong, Yuhang Chen, Zewei Xu, and Rong Yu. Wise-iou: Bounding box regres- sion loss with dynamic focusing mechanism, 2023

  22. [22]

    Unit- box: An advanced object detection network

    Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas Huang. Unit- box: An advanced object detection network. InProceedings of the 24th ACM In- ternational Conference on Multimedia, MM ’16, page 516–520, New York, NY , USA, 2016. Association for Computing Machinery. ISBN 9781450336031. doi: 10.1145/2964284.2967274

  23. [23]

    Focal and efficient iou loss for accurate bounding box regression.Neurocomputing, 506:146–157, 2022

    Yi-Fan Zhang, Weiqiang Ren, Zhang Zhang, Zhen Jia, Liang Wang, and Tieniu Tan. Focal and efficient iou loss for accurate bounding box regression.Neurocomputing, 506:146–157, 2022. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2022.07. 042. MOEIOU: RETHINKING BOUNDING-BOX REGRESSION AS A MIXTURE OF EXPERTS17

  24. [24]

    Object detection with deep learning: A review.IEEE Transactions on Neural Networks and Learning Sys- tems, 30(11):3212–3232, 2019

    Zhong-Qiu Zhao, Peng Zheng, Shou-Tao Xu, and Xindong Wu. Object detection with deep learning: A review.IEEE Transactions on Neural Networks and Learning Sys- tems, 30(11):3212–3232, 2019. doi: 10.1109/TNNLS.2018.2876865

  25. [25]

    Distance-iou loss: Faster and better learning for bounding box regression

    Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and better learning for bounding box regression. InPro- ceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 12993–13000, 2020. doi: 10.1609/aaai.v34i07.6999

  26. [26]

    Enhancing geometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52: 8574–8586, 2020

    Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rongguang Ye, Qinghua Hu, and Wangmeng Zuo. Enhancing geometric factors in model learning and inference for object detection and instance segmentation.IEEE Transactions on Cybernetics, 52: 8574–8586, 2020. doi: 10.48550/arXiv.2005.03572