QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing

Garvit Kumar Mittal; Sahil Tomar; Sandeep Kumar

arxiv: 2604.26435 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI· cs.ET

QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing

Garvit Kumar Mittal , Sahil Tomar , Sandeep Kumar This is my paper

Pith reviewed 2026-05-07 11:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ET

keywords lightweight object detectionYOLO compressionquantum-inspiredchannel mixingshared parametersbackbone optimizationreal-time perceptionVisDrone benchmark

0 comments

The pith

A quantum-inspired shared sinusoidal mixer can replace the two deepest C2f modules in YOLOv8 backbones to cut parameters by about 20 percent with under 0.5 point accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that genuine compression in single-stage object detectors is possible by targeting the quadratic parameter growth in the deepest backbone stages. It replaces the P4/16 and P5/32 C2f blocks with a compact QMixBlock that performs global channel recalibration using sinusoidal mixing and one shared set of learnable parameters. The neck and head stay classical, so the savings come purely from the backbone redesign. A reader would care because this keeps real-time inference viable on constrained hardware while the model still works on challenging aerial datasets. When distillation is added, any small accuracy gap disappears without undoing the compression.

Core claim

The central claim is that the QMixBlock, which applies sinusoidal mixing for global channel recalibration with parameters shared across the 512-channel and 1024-channel stages, can stand in for the two heaviest C2f modules in the YOLOv8 backbone. On VisDrone2019 this produces a 20.2 percent parameter drop (3.01M to 2.40M) and 12.3 percent GFLOPs drop for the nano model with only 0.4 pp mAP@50 degradation; the small model sees 21.8 percent parameter reduction with 0.1 pp degradation. The shared parameters enforce uniform channel importance without separate per-stage sets, and the unchanged neck and head preserve the original detection pipeline.

What carries the argument

The QMixBlock, a compact replacement module that performs global channel recalibration via sinusoidal mixing with a single shared set of learnable parameters across the two deepest backbone stages.

If this is right

The neck and detection head remain fully classical, so the compression does not propagate changes through the rest of the architecture.
Knowledge distillation restores full accuracy parity with no additional parameter cost.
Limiting changes to the backbone alone yields better accuracy-compression balance than expanding modifications to the neck.
The approach directly attacks quadratic scaling in high-stride C2f modules while preserving real-time detection speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared sinusoidal mechanism could be tested in other single-stage detectors that rely on similar bottleneck blocks to see whether the compression pattern generalizes beyond YOLOv8.
If the parameters prove robust across datasets, future designs might reduce the need for per-stage hyperparameter searches in multi-resolution backbones.
Evaluating the same replacement on ground-level datasets such as COCO would reveal whether the minimal degradation holds when object scales and channel statistics differ from aerial imagery.

Load-bearing premise

A single set of shared learnable sinusoidal parameters can perform effective global channel recalibration for both the 512-channel and 1024-channel stages without needing stage-specific parameters or losing critical representational capacity.

What would settle it

If independent per-stage sinusoidal parameters produce substantially higher mAP at roughly the same total parameter count, the shared-parameter design would be shown to sacrifice accuracy for simplicity.

Figures

Figures reproduced from arXiv: 2604.26435 by Garvit Kumar Mittal, Sahil Tomar, Sandeep Kumar.

**Figure 1.** Figure 1: Overview of the proposed QYOLO architecture. The QMixBlock replaces C2f modules at backbone layers P4/16 and P5/32. The view at source ↗

**Figure 2.** Figure 2: Conceptual Comparison of Parameter Sharing Strategies in Feature Pyramid Networks view at source ↗

read the original abstract

The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QYOLO gets modest 20% param savings on YOLOv8 by swapping deep C2f blocks for a shared sinusoidal mixer, but the mechanism lacks ablations and the shared-parameter claim looks under-tested.

read the letter

This paper replaces the two deepest C2f modules in YOLOv8 backbones with a QMixBlock that applies sinusoidal channel mixing using one shared set of learnable parameters for both the 512-channel and 1024-channel stages. On VisDrone2019 it reports 20.2% fewer parameters and 12.3% fewer GFLOPs for the nano model with a 0.4 point mAP@50 drop, and even smaller accuracy loss for the small variant; distillation then restores parity while keeping the compression. The neck and head stay untouched, which keeps the change localized.

Referee Report

2 major / 2 minor

Summary. The paper introduces QYOLO, a modification to YOLOv8 that replaces the two deepest backbone C2f modules (at P4/16 with 512 channels and P5/32 with 1024 channels) with a single compact QMixBlock. This block performs global channel recalibration via a sinusoidal mixing mechanism that uses one shared set of learnable parameters across both stages. The neck and head remain unchanged. On the VisDrone2019 benchmark, QYOLOv8n reports a 20.2% parameter reduction (3.01M to 2.40M) and 12.3% GFLOPs reduction with 0.4 pp mAP@50 drop; QYOLOv8s shows 21.8% reduction with 0.1 pp drop. Knowledge distillation recovers full accuracy parity while preserving the compression.

Significance. If the shared sinusoidal parameters prove effective for recalibrating channels at both widths, the work would demonstrate a practical route to architectural compression in single-stage detectors by avoiding per-stage parameter duplication. The concrete deltas on a public benchmark, the option to restore accuracy via distillation, and the focus on backbone-only changes are strengths that could inform efficient real-time detection designs. However, the lack of ablations isolating the shared-parameter mechanism limits attribution of gains specifically to the quantum-inspired component.

major comments (2)

[QMixBlock description] Methods section describing the QMixBlock: the central claim that a single shared set of learnable sinusoidal parameters suffices for effective global recalibration at both 512-channel and 1024-channel stages lacks any supporting equation, pseudocode, or analysis showing how the same frequency/phase/amplitude values adapt when input dimensionality doubles. This is load-bearing for the efficiency claim, as the 20% parameter cut and sub-0.5 pp mAP degradation are attributed to the shared design rather than simple module removal.
[Evaluation on VisDrone2019] Experimental results and ablation studies: no experiments compare the shared-parameter QMixBlock against variants with stage-specific parameters or against a non-sinusoidal lightweight replacement for the C2f modules. Without such controls, the reported 0.4 pp and 0.1 pp mAP@50 degradations cannot be confidently linked to the quantum-inspired shared mixing rather than the particular VisDrone split or training details.

minor comments (2)

[Introduction] The abstract and introduction refer to 'quantum-inspired' sinusoidal mixing but provide no explicit connection or reference to quantum computing concepts that motivate the sinusoidal form.
[Experiments] Training protocol, optimizer settings, and data augmentation details are not specified, which hinders exact reproduction of the reported mAP and efficiency numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and commit to revisions that improve the clarity of the QMixBlock formulation and strengthen the experimental analysis.

read point-by-point responses

Referee: [QMixBlock description] Methods section describing the QMixBlock: the central claim that a single shared set of learnable sinusoidal parameters suffices for effective global recalibration at both 512-channel and 1024-channel stages lacks any supporting equation, pseudocode, or analysis showing how the same frequency/phase/amplitude values adapt when input dimensionality doubles. This is load-bearing for the efficiency claim, as the 20% parameter cut and sub-0.5 pp mAP degradation are attributed to the shared design rather than simple module removal.

Authors: We agree that the original methods section presented the QMixBlock at an insufficient level of detail. In the revised manuscript we will add the explicit mathematical formulation, pseudocode, and adaptation analysis. The shared parameters consist of a compact set of learnable frequencies ω, phases φ, and amplitudes a (K=16). For an input with C channels the channel descriptor d is obtained by global average pooling; the recalibration weights are then r_c = sigmoid(∑_k a_k ⋅ sin(ω_k ⋅ (c/C) + φ_k)) for channel index c. Normalizing the argument by C makes the same parameter set operate on both 512- and 1024-channel stages without modification. We will include the corresponding pseudocode and a short paragraph explaining why this dimension-agnostic sinusoidal mixing avoids per-stage duplication while preserving recalibration effectiveness. These additions will appear in Section 3.2. revision: yes
Referee: [Evaluation on VisDrone2019] Experimental results and ablation studies: no experiments compare the shared-parameter QMixBlock against variants with stage-specific parameters or against a non-sinusoidal lightweight replacement for the C2f modules. Without such controls, the reported 0.4 pp and 0.1 pp mAP@50 degradations cannot be confidently linked to the quantum-inspired shared mixing rather than the particular VisDrone split or training details.

Authors: The referee is correct that the submitted version lacks these targeted controls. We will add an ablation subsection that directly compares three variants under identical training protocols on VisDrone2019: (i) the proposed shared sinusoidal QMixBlock, (ii) an otherwise identical block with independent parameters per stage, and (iii) a non-sinusoidal lightweight replacement using a single linear projection. The new results will quantify parameter counts, GFLOPs, and mAP@50 for each, allowing readers to attribute the observed compression and accuracy retention specifically to the shared quantum-inspired mechanism rather than training details or dataset split. These experiments will be reported in the main text or supplementary material as space permits. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmark evaluation

full rationale

The paper introduces QMixBlock as a replacement for the deepest C2f modules, using a sinusoidal mixing mechanism with one shared set of learnable parameters across the 512- and 1024-channel stages. All reported gains (20.2% parameter reduction, 12.3% GFLOPs reduction, <0.5 pp mAP@50 change) are obtained by direct measurement against unmodified YOLOv8 baselines on the external VisDrone2019 dataset; no first-principles derivation, fitted-parameter prediction, or self-citation chain is invoked to establish the result. The design choice of shared parameters is presented as an engineering decision whose validity is tested empirically rather than assumed by construction or prior self-work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the unproven effectiveness of the sinusoidal mixing mechanism and the sufficiency of shared parameters; no external benchmarks or formal derivations are supplied beyond the empirical results.

free parameters (1)

shared learnable parameters for sinusoidal mixing
Learnable weights in the QMixBlock that are fitted during training to enable channel recalibration.

axioms (1)

domain assumption Sinusoidal functions with shared parameters can replace independent C2f bottleneck processing for global channel importance
Invoked in the design of the QMixBlock to justify architectural compression.

invented entities (1)

QMixBlock no independent evidence
purpose: Compact quantum-inspired replacement for deep C2f modules using shared sinusoidal channel mixing
New architectural component introduced to achieve the reported compression; no independent evidence outside the VisDrone experiments.

pith-pipeline@v0.9.0 · 5571 in / 1354 out tokens · 59931 ms · 2026-05-07T11:10:11.994662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Ultralytics YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics 13

work page 2023
[2]

Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016

work page 2016
[3]

Pruning filters for efficient ConvNets,

H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient ConvNets,” inProceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017

work page 2017
[4]

A compression framework for YOLOv8 enabling real-time aerial object detection on edge devices through structured pruning,

M. Sabaghian, H. Marvi, and S. M. R. Hashemi, “A compression framework for YOLOv8 enabling real-time aerial object detection on edge devices through structured pruning,”arXiv preprint arXiv:2509.12918, 2025

work page arXiv 2025
[5]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review arXiv 2015
[6]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 7132–7141

work page 2018
[7]

ECA-Net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient channel attention for deep convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, W A, USA, 2020, pp. 11531–11539

work page 2020
[8]

Quantum circuit learning,

K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,”Physical Review A, vol. 98, no. 3, p. 032309, 2018

work page 2018
[9]

The quest for a quantum neural network,

M. Schuld, I. Sinayskiy, and F. Petruccione, “The quest for a quantum neural network,”Quantum Information Processing, vol. 13, no. 11, pp. 2567–2586, 2014

work page 2014
[10]

Implicit neural representations with periodic activation functions,

V . Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 7462–7473

work page 2020
[11]

VisDrone-DET2019: The vision meets drone object detection in image challenge results,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, and Q. Hu, “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, South Korea, 2019

work page 2019
[12]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2117–2125

work page 2017
[13]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 8759–8768

work page 2018
[14]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y . Bengio, and A. Courville, ”On the spectral bias of neural networks,” in Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 2019, pp. 5301–5310

work page 2019
[15]

Zhang, Y

P. Zhang, Y . Zhong, and X. Li, ”SlimYOLOv3: Narrower, faster and better for real-time UA V applications,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, South Korea, 2019, pp. 37–45

work page 2019
[16]

X. Zhao, Z. Yang, and H. Zhao, ”DCS-YOLOv8: A lightweight context-aware network for small object detection in UA V remote sensing imagery,” Remote Sensing, vol. 17, no. 17, p. 2989, 2025, doi: 10.3390/rs17172989

work page doi:10.3390/rs17172989 2025
[17]

S. Ren, K. He, R. Girshick, and J. Sun, ”Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015

work page 2015
[18]

J. Kang, H. Yang, and H. Kim, ”Simplifying two-stage object detectors for on-board remote sensing,” IEEE Access, vol. 13, pp. 145703–145713, 2025

work page 2025

[1] [1]

Ultralytics YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics 13

work page 2023

[2] [2]

Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” inProceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016

work page 2016

[3] [3]

Pruning filters for efficient ConvNets,

H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient ConvNets,” inProceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 2017

work page 2017

[4] [4]

A compression framework for YOLOv8 enabling real-time aerial object detection on edge devices through structured pruning,

M. Sabaghian, H. Marvi, and S. M. R. Hashemi, “A compression framework for YOLOv8 enabling real-time aerial object detection on edge devices through structured pruning,”arXiv preprint arXiv:2509.12918, 2025

work page arXiv 2025

[5] [5]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review arXiv 2015

[6] [6]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 7132–7141

work page 2018

[7] [7]

ECA-Net: Efficient channel attention for deep convolutional neural networks,

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient channel attention for deep convolutional neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, W A, USA, 2020, pp. 11531–11539

work page 2020

[8] [8]

Quantum circuit learning,

K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,”Physical Review A, vol. 98, no. 3, p. 032309, 2018

work page 2018

[9] [9]

The quest for a quantum neural network,

M. Schuld, I. Sinayskiy, and F. Petruccione, “The quest for a quantum neural network,”Quantum Information Processing, vol. 13, no. 11, pp. 2567–2586, 2014

work page 2014

[10] [10]

Implicit neural representations with periodic activation functions,

V . Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 7462–7473

work page 2020

[11] [11]

VisDrone-DET2019: The vision meets drone object detection in image challenge results,

P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, and Q. Hu, “VisDrone-DET2019: The vision meets drone object detection in image challenge results,” inProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, South Korea, 2019

work page 2019

[12] [12]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2117–2125

work page 2017

[13] [13]

Path aggregation network for instance segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 8759–8768

work page 2018

[14] [14]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y . Bengio, and A. Courville, ”On the spectral bias of neural networks,” in Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 2019, pp. 5301–5310

work page 2019

[15] [15]

Zhang, Y

P. Zhang, Y . Zhong, and X. Li, ”SlimYOLOv3: Narrower, faster and better for real-time UA V applications,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, South Korea, 2019, pp. 37–45

work page 2019

[16] [16]

X. Zhao, Z. Yang, and H. Zhao, ”DCS-YOLOv8: A lightweight context-aware network for small object detection in UA V remote sensing imagery,” Remote Sensing, vol. 17, no. 17, p. 2989, 2025, doi: 10.3390/rs17172989

work page doi:10.3390/rs17172989 2025

[17] [17]

S. Ren, K. He, R. Girshick, and J. Sun, ”Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 28, 2015

work page 2015

[18] [18]

J. Kang, H. Yang, and H. Kim, ”Simplifying two-stage object detectors for on-board remote sensing,” IEEE Access, vol. 13, pp. 145703–145713, 2025

work page 2025