QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing
Pith reviewed 2026-05-07 11:10 UTC · model grok-4.3
The pith
A quantum-inspired shared sinusoidal mixer can replace the two deepest C2f modules in YOLOv8 backbones to cut parameters by about 20 percent with under 0.5 point accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the QMixBlock, which applies sinusoidal mixing for global channel recalibration with parameters shared across the 512-channel and 1024-channel stages, can stand in for the two heaviest C2f modules in the YOLOv8 backbone. On VisDrone2019 this produces a 20.2 percent parameter drop (3.01M to 2.40M) and 12.3 percent GFLOPs drop for the nano model with only 0.4 pp mAP@50 degradation; the small model sees 21.8 percent parameter reduction with 0.1 pp degradation. The shared parameters enforce uniform channel importance without separate per-stage sets, and the unchanged neck and head preserve the original detection pipeline.
What carries the argument
The QMixBlock, a compact replacement module that performs global channel recalibration via sinusoidal mixing with a single shared set of learnable parameters across the two deepest backbone stages.
If this is right
- The neck and detection head remain fully classical, so the compression does not propagate changes through the rest of the architecture.
- Knowledge distillation restores full accuracy parity with no additional parameter cost.
- Limiting changes to the backbone alone yields better accuracy-compression balance than expanding modifications to the neck.
- The approach directly attacks quadratic scaling in high-stride C2f modules while preserving real-time detection speed.
Where Pith is reading between the lines
- The shared sinusoidal mechanism could be tested in other single-stage detectors that rely on similar bottleneck blocks to see whether the compression pattern generalizes beyond YOLOv8.
- If the parameters prove robust across datasets, future designs might reduce the need for per-stage hyperparameter searches in multi-resolution backbones.
- Evaluating the same replacement on ground-level datasets such as COCO would reveal whether the minimal degradation holds when object scales and channel statistics differ from aerial imagery.
Load-bearing premise
A single set of shared learnable sinusoidal parameters can perform effective global channel recalibration for both the 512-channel and 1024-channel stages without needing stage-specific parameters or losing critical representational capacity.
What would settle it
If independent per-stage sinusoidal parameters produce substantially higher mAP at roughly the same total parameter count, the shared-parameter design would be shown to sacrifice accuracy for simplicity.
Figures
read the original abstract
The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces QYOLO, a modification to YOLOv8 that replaces the two deepest backbone C2f modules (at P4/16 with 512 channels and P5/32 with 1024 channels) with a single compact QMixBlock. This block performs global channel recalibration via a sinusoidal mixing mechanism that uses one shared set of learnable parameters across both stages. The neck and head remain unchanged. On the VisDrone2019 benchmark, QYOLOv8n reports a 20.2% parameter reduction (3.01M to 2.40M) and 12.3% GFLOPs reduction with 0.4 pp mAP@50 drop; QYOLOv8s shows 21.8% reduction with 0.1 pp drop. Knowledge distillation recovers full accuracy parity while preserving the compression.
Significance. If the shared sinusoidal parameters prove effective for recalibrating channels at both widths, the work would demonstrate a practical route to architectural compression in single-stage detectors by avoiding per-stage parameter duplication. The concrete deltas on a public benchmark, the option to restore accuracy via distillation, and the focus on backbone-only changes are strengths that could inform efficient real-time detection designs. However, the lack of ablations isolating the shared-parameter mechanism limits attribution of gains specifically to the quantum-inspired component.
major comments (2)
- [QMixBlock description] Methods section describing the QMixBlock: the central claim that a single shared set of learnable sinusoidal parameters suffices for effective global recalibration at both 512-channel and 1024-channel stages lacks any supporting equation, pseudocode, or analysis showing how the same frequency/phase/amplitude values adapt when input dimensionality doubles. This is load-bearing for the efficiency claim, as the 20% parameter cut and sub-0.5 pp mAP degradation are attributed to the shared design rather than simple module removal.
- [Evaluation on VisDrone2019] Experimental results and ablation studies: no experiments compare the shared-parameter QMixBlock against variants with stage-specific parameters or against a non-sinusoidal lightweight replacement for the C2f modules. Without such controls, the reported 0.4 pp and 0.1 pp mAP@50 degradations cannot be confidently linked to the quantum-inspired shared mixing rather than the particular VisDrone split or training details.
minor comments (2)
- [Introduction] The abstract and introduction refer to 'quantum-inspired' sinusoidal mixing but provide no explicit connection or reference to quantum computing concepts that motivate the sinusoidal form.
- [Experiments] Training protocol, optimizer settings, and data augmentation details are not specified, which hinders exact reproduction of the reported mAP and efficiency numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and commit to revisions that improve the clarity of the QMixBlock formulation and strengthen the experimental analysis.
read point-by-point responses
-
Referee: [QMixBlock description] Methods section describing the QMixBlock: the central claim that a single shared set of learnable sinusoidal parameters suffices for effective global recalibration at both 512-channel and 1024-channel stages lacks any supporting equation, pseudocode, or analysis showing how the same frequency/phase/amplitude values adapt when input dimensionality doubles. This is load-bearing for the efficiency claim, as the 20% parameter cut and sub-0.5 pp mAP degradation are attributed to the shared design rather than simple module removal.
Authors: We agree that the original methods section presented the QMixBlock at an insufficient level of detail. In the revised manuscript we will add the explicit mathematical formulation, pseudocode, and adaptation analysis. The shared parameters consist of a compact set of learnable frequencies ω, phases φ, and amplitudes a (K=16). For an input with C channels the channel descriptor d is obtained by global average pooling; the recalibration weights are then r_c = sigmoid(∑_k a_k ⋅ sin(ω_k ⋅ (c/C) + φ_k)) for channel index c. Normalizing the argument by C makes the same parameter set operate on both 512- and 1024-channel stages without modification. We will include the corresponding pseudocode and a short paragraph explaining why this dimension-agnostic sinusoidal mixing avoids per-stage duplication while preserving recalibration effectiveness. These additions will appear in Section 3.2. revision: yes
-
Referee: [Evaluation on VisDrone2019] Experimental results and ablation studies: no experiments compare the shared-parameter QMixBlock against variants with stage-specific parameters or against a non-sinusoidal lightweight replacement for the C2f modules. Without such controls, the reported 0.4 pp and 0.1 pp mAP@50 degradations cannot be confidently linked to the quantum-inspired shared mixing rather than the particular VisDrone split or training details.
Authors: The referee is correct that the submitted version lacks these targeted controls. We will add an ablation subsection that directly compares three variants under identical training protocols on VisDrone2019: (i) the proposed shared sinusoidal QMixBlock, (ii) an otherwise identical block with independent parameters per stage, and (iii) a non-sinusoidal lightweight replacement using a single linear projection. The new results will quantify parameter counts, GFLOPs, and mAP@50 for each, allowing readers to attribute the observed compression and accuracy retention specifically to the shared quantum-inspired mechanism rather than training details or dataset split. These experiments will be reported in the main text or supplementary material as space permits. revision: yes
Circularity Check
No circularity; empirical claims rest on external benchmark evaluation
full rationale
The paper introduces QMixBlock as a replacement for the deepest C2f modules, using a sinusoidal mixing mechanism with one shared set of learnable parameters across the 512- and 1024-channel stages. All reported gains (20.2% parameter reduction, 12.3% GFLOPs reduction, <0.5 pp mAP@50 change) are obtained by direct measurement against unmodified YOLOv8 baselines on the external VisDrone2019 dataset; no first-principles derivation, fitted-parameter prediction, or self-citation chain is invoked to establish the result. The design choice of shared parameters is presented as an engineering decision whose validity is tested empirically rather than assumed by construction or prior self-work.
Axiom & Free-Parameter Ledger
free parameters (1)
- shared learnable parameters for sinusoidal mixing
axioms (1)
- domain assumption Sinusoidal functions with shared parameters can replace independent C2f bottleneck processing for global channel importance
invented entities (1)
-
QMixBlock
no independent evidence
Reference graph
Works this paper leans on
-
[1]
G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics 13
work page 2023
-
[2]
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016
work page 2016
-
[3]
Pruning filters for efficient ConvNets,
H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient ConvNets,” inProceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017
work page 2017
-
[4]
M. Sabaghian, H. Marvi, and S. M. R. Hashemi, “A compression framework for YOLOv8 enabling real-time aerial object detection on edge devices through structured pruning,”arXiv preprint arXiv:2509.12918, 2025
-
[5]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review arXiv 2015
-
[6]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 7132–7141
work page 2018
-
[7]
ECA-Net: Efficient channel attention for deep convolutional neural networks,
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient channel attention for deep convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, W A, USA, 2020, pp. 11531–11539
work page 2020
-
[8]
K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,”Physical Review A, vol. 98, no. 3, p. 032309, 2018
work page 2018
-
[9]
The quest for a quantum neural network,
M. Schuld, I. Sinayskiy, and F. Petruccione, “The quest for a quantum neural network,”Quantum Information Processing, vol. 13, no. 11, pp. 2567–2586, 2014
work page 2014
-
[10]
Implicit neural representations with periodic activation functions,
V . Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 7462–7473
work page 2020
-
[11]
VisDrone-DET2019: The vision meets drone object detection in image challenge results,
P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, and Q. Hu, “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, South Korea, 2019
work page 2019
-
[12]
Feature pyramid networks for object detection,
T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2117–2125
work page 2017
-
[13]
Path aggregation network for instance segmentation,
S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 8759–8768
work page 2018
-
[14]
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y . Bengio, and A. Courville, ”On the spectral bias of neural networks,” in Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 2019, pp. 5301–5310
work page 2019
- [15]
-
[16]
X. Zhao, Z. Yang, and H. Zhao, ”DCS-YOLOv8: A lightweight context-aware network for small object detection in UA V remote sensing imagery,” Remote Sensing, vol. 17, no. 17, p. 2989, 2025, doi: 10.3390/rs17172989
-
[17]
S. Ren, K. He, R. Girshick, and J. Sun, ”Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015
work page 2015
-
[18]
J. Kang, H. Yang, and H. Kim, ”Simplifying two-stage object detectors for on-board remote sensing,” IEEE Access, vol. 13, pp. 145703–145713, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.