arxiv: 2604.08038 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

Jun Li , Yingying Shi , Zhixuan Ruan , Nan Guo , Jianhua Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords traffic object detectionstate-space modelsMambadeformable dilated convolutionsmulti-scale feature fusionhierarchical representationcomputer vision

0 comments

The pith

A hybrid backbone interleaving multi-scale deformable dilated convolutions with Mamba blocks improves detection of varying-size objects in cluttered traffic scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that Mamba-based detectors, effective for long-range dependencies, fall short when traffic scenes contain small objects with rich local details and require strong hierarchical cross-scale interactions. It proposes MDDCNet, whose backbone alternates Multi-Scale Deformable Dilated Convolution blocks with Mamba blocks to build features progressively from local structures to global semantics. Additional modules strengthen channel mixing and multi-scale aggregation. Experiments on standard benchmarks and real-world traffic data are presented as evidence that the resulting detector outperforms prior advanced methods. If the approach holds, state-space models can be made practically viable for vision tasks that demand both fine spatial detail and broad context.

Core claim

The central claim is that successive Multi-Scale Deformable Dilated Convolution blocks combined with Mamba blocks, augmented by a Channel-Enhanced Feed-Forward Network and a Mamba-based Attention-Aggregating Feature Pyramid Network, produce hierarchical feature representations that capture both local details and cross-scale interactions, thereby overcoming the limitations of flat sequential modeling in Mamba for accurate multi-scale traffic object detection.

What carries the argument

The Multi-Scale Deformable Dilated Convolution block, which supplies local detail extraction and spatial inductive biases when placed in alternation with Mamba blocks inside the backbone.

If this is right

The hybrid design yields measurable gains in detecting small and varying-scale objects against cluttered backgrounds.
Channel-enhanced feed-forward processing increases the model's ability to mix information across feature channels.
Mamba-based attention aggregation produces stronger multi-scale feature fusion than conventional pyramid networks.
The resulting detector records higher performance than various advanced methods on both benchmark and real-world traffic collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interleaving pattern could be tested on other state-space vision backbones to see whether deformable dilated convolutions supply a general remedy for missing local inductive bias.
If the gains prove stable, traffic surveillance systems might adopt the architecture to reduce missed detections of distant or small vehicles.
The work leaves open whether similar convolutional enhancements would improve Mamba models on non-traffic detection tasks that also involve scale variation.

Load-bearing premise

The specific interleaving of Multi-Scale Deformable Dilated Convolution blocks with Mamba blocks, together with the two added modules, will reliably compensate for Mamba's weaknesses in local detail capture and hierarchical cross-scale interaction when applied to real traffic images.

What would settle it

Running the exact same evaluation protocol on the public benchmark and real-world datasets and finding that MDDCNet achieves equal or lower accuracy than strong Mamba-only baselines or other recent detectors would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.08038 by Jianhua Xu, Jun Li, Nan Guo, Yingying Shi, Zhixuan Ruan.

**Figure 1.** Figure 1: The model architecture of our proposed MDDCNet which comprises Hybrid Backbone (a), Attention-Aggregating Feature Pyramid Network (𝐴2FPN) (b) and Detection Head (c). 3.2. Hybrid Backbone Network To handle the challenge of multi-scale object distribution and contextual dependency in complex traffic scenarios, we propose a hierarchical CNN-Mamba hybrid backbone to simultaneously capture local details and mod… view at source ↗

**Figure 2.** Figure 2: Illustration of the hybrid backbone network consisting of successive MSDDC and Mamba blocks [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of our proposed CE-FFN (c) and the vanilla FFN (a) and FFN with CA-Block (b). 𝑌 = GELU ( DWConv3×3 ( Conv1×1(𝑋) )) 𝐹CE_FFN = Conv1×1 ( 𝐹global(𝑌 ) + 𝐹local(𝑌 ) ) + 𝑋 (5) Inheriting the idea of local aggregation from CA Block, the local branch is formulated as: 𝐹local = 𝑟 ⊙ ( 𝑌 − GELU(Conv1×1(𝑌 ))) + 𝑌 (6) Through channel transformation with a residual modulation mechanism, the local branch enhan… view at source ↗

**Figure 4.** Figure 4: Our proposed Contextual-Spatial-Channel Attention (CSCA) synergy module. Comprising three complementary attention branches, it leverages an attention-aggregating mechanism for comprehensively strengthening feature discriminative power, which significantly benefits multi-scale feature fusion and interaction. 3.3. Attention-Aggregating Feature Pyramid Network (A2FPN) To maintain both high-level semantics an… view at source ↗

**Figure 5.** Figure 5: Statistics of KITTI dataset in traffic object categories (a) and scale distributions (b). dynamically modulate the fused features, enabling the network to adaptively calibrate its semantic distribution during cross-scale feature interaction, thereby enhancing cross-scale consistency and overall perception ability. The three complementary attention are aggregated, providing comprehensive feature enhancement… view at source ↗

**Figure 6.** Figure 6: Statistics of our RTOD dataset in traffic object categories (a) and scale distributions (b) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Example images in RTOD dataset. 4.2. Experimental Setup Both datasets are randomly partitioned into training, validation, and test sets with a ratio of 8:1:1 for respective model training, hyperparameter tuning, and performance evaluation. Moreover, both accuracy and efficiency metrics are involved for performance measure, including precision score, recall rate, mAP, FLOPs and FPS (Frame Per Second). For d… view at source ↗

**Figure 8.** Figure 8: PR curves of YOLOv8n (a), YOLOv10n (b), YOLOv11n (c), YOLOv12n (d), YOLOv13n (e) and our MDDCNet-T (f) on the KITTI dataset. when detecting vehicle-related objects (car, van, truck, tram), indicating that the proposed method has stronger modeling capabilities and discriminative capabilities. For small-object categories such as pedestrians and cyclists, our MDDCNet-T model outperforms YOLOv10s by approximat… view at source ↗

**Figure 9.** Figure 9: PR curves of YOLOv8n (a), YOLOv13n (b), MambaYOLO-T (c), and our MDDCNet-T (d) on the RTOD dataset [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of YOLOv13n and our MDDCNet on the KITTI dataset. Our MDDCNet demonstrates superior performance with fewer missed detections. blocks, our carefully designed Channel-Enhanced FeedForward Network (CE-FFN) module further strengthens interchannel interaction. Moreover, we devise an Attention-Aggregating Feature Pyramid Network (𝐴2FPN) to facilitate effective multi-scale feature fusion … view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of YOLOv13n and our MDDCNet on the RTOD dataset. It can be observed that our method outperforms YOLOv13n by achieving more accurate results with fewer missed targets and lower false positive rates. Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y., 2024. Vmamba: Visual state space model. Advances in neural information processing systems 37, 103031–103… view at source ↗

read the original abstract

In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDDCNet hybrids Mamba with multi-scale deformable dilated convolutions to fix local-detail and cross-scale gaps in traffic detection, but the reported gains need capacity-controlled ablations to pin down what actually drives them.

read the letter

The main new piece is the MDDCNet backbone that alternates MSDDC blocks with Mamba blocks, plus the CE-FFN for channel mixing and the Mamba-based A^2FPN for feature fusion. This directly targets Mamba's reported trouble with small objects and hierarchical scale handling in cluttered traffic scenes, which is a practical concern for driving and monitoring applications. The authors lay out the motivation clearly and release code, which makes the work easier to inspect and extend. That combination of modules is a legitimate engineering step beyond plain Mamba or standard deformable convs, even if it builds on known ideas rather than deriving something from first principles. The experiments claim better numbers than several recent detectors on both benchmarks and real data, which is the kind of result that matters for this domain. The soft spot is the lack of clear evidence that the improvements come from the new blocks instead of extra parameters or training differences. If the full paper shows parameter counts, FLOPs, and ablations that keep total capacity matched when swapping components in and out, the claims tighten up; otherwise the superiority is harder to attribute. The architecture itself looks sound and the citations track the relevant prior work without obvious gaps. This paper is for CV researchers working on efficient multi-scale detection for traffic or autonomous driving. A reader who needs a concrete Mamba variant for that setting can pull useful implementation details from it. I would send it to peer review because the motivation is focused, the code is public, and the core idea is testable; referees can verify the ablations and push for tighter controls if needed.

Referee Report

1 major / 1 minor

Summary. The paper proposes MDDCNet, a hybrid architecture for multi-scale traffic object detection that combines successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks with Mamba blocks in the backbone to jointly model local details and global semantics. It further introduces a Channel-Enhanced Feed-Forward Network (CE-FFN) to improve channel interactions and a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) for enhanced cross-scale feature fusion. The central claim, supported by the abstract, is that this design overcomes Mamba's limitations in local feature capture and hierarchical interactions, with extensive experiments on public benchmarks and real-world datasets demonstrating superiority over advanced detectors.

Significance. If the empirical gains are shown to arise specifically from the proposed modules rather than capacity or training differences, the work would offer a practical route to augmenting state-space models with convolutional inductive biases for detection in cluttered, multi-scale scenes. The public release of code is a clear strength that supports reproducibility and further investigation.

major comments (1)

[Abstract and Experiments] Abstract and Experiments section: The claim that MDDCNet demonstrates superiority by overcoming Mamba's local-detail and cross-scale limitations rests on the assumption that reported gains are attributable to the MSDDC blocks, CE-FFN, and A^2FPN. No capacity-matched ablations (e.g., removing MSDDC while holding total parameters fixed) or comparisons of parameter counts/FLOPs against the pure Mamba baseline and competing detectors are indicated, which is required to rule out increased model capacity as the source of improvement.

minor comments (1)

[Abstract] Abstract: The specific public benchmark and real-world datasets used are not named, which limits immediate assessment of the scope and difficulty of the evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and the positive assessment of the work's potential contribution. The major concern regarding attribution of gains to the proposed modules versus model capacity is well-taken. We address it directly below and commit to revisions that strengthen the empirical claims.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The claim that MDDCNet demonstrates superiority by overcoming Mamba's local-detail and cross-scale limitations rests on the assumption that reported gains are attributable to the MSDDC blocks, CE-FFN, and A^2FPN. No capacity-matched ablations (e.g., removing MSDDC while holding total parameters fixed) or comparisons of parameter counts/FLOPs against the pure Mamba baseline and competing detectors are indicated, which is required to rule out increased model capacity as the source of improvement.

Authors: We agree that explicit controls for model capacity are necessary to substantiate the claims. The current manuscript does not report parameter counts, FLOPs, or capacity-matched ablations against the pure Mamba baseline. In the revised version we will add a dedicated table comparing parameter counts and FLOPs of MDDCNet, the Mamba baseline, and competing detectors. We will also include capacity-matched ablation experiments (e.g., scaling the baseline Mamba model to match MDDCNet's parameter budget) and report the resulting performance differences. These additions will be placed in the Experiments section and referenced in the abstract where appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with no derivation chain

full rationale

The paper introduces MDDCNet as a hybrid backbone of MSDDC blocks and Mamba blocks plus CE-FFN and A^2FPN modules, motivated by stated limitations of prior Mamba methods in local detail and cross-scale modeling. All load-bearing claims rest on experimental comparisons on public and real-world datasets rather than any first-principles derivation, fitted-parameter prediction, or self-citation that reduces to the inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that could create circularity; the work is self-contained as an empirical engineering contribution.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper rests on standard deep-learning assumptions about feature hierarchies and the ability of hybrid local-global modules to improve detection; it introduces no new physical entities and treats architectural hyperparameters as design choices rather than fitted constants.

free parameters (2)

dilation rates and deformable sampling offsets
These control the multi-scale receptive fields in MSDDC blocks and are either learned or manually set to capture varying object sizes.
number and placement of MSDDC and Mamba blocks
Architectural depth and interleaving pattern chosen to balance local detail and global semantics.

axioms (2)

domain assumption Mamba-based models have limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling.
Directly stated in the abstract as the motivation for the hybrid design.
domain assumption Deformable dilated convolutions can supply the missing local inductive biases while preserving Mamba's long-range efficiency.
Core premise underlying the MSDDC block and overall backbone.

pith-pipeline@v0.9.0 · 5565 in / 1466 out tokens · 85402 ms · 2026-05-10T18:18:07.752869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 8 canonical work pages · 6 internal anchors

[1]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 . Carion, N., Massa, F., Synnaeve, G., et al.,

work page internal anchor Pith review arXiv 2004
[2]

End-to-end object detection with transformers, in: European Conference on Computer Vision (ECCV), pp. 213–229. Chen,K.,Chen,B.,Liu,C.,Li,W.,Zou,Z.,Shi,Z.,2024. Rsmamba:Remotesensingimageclassificationwithstatespacemodel. IEEEGeoscience and Remote Sensing Letters 21, 1–5. Gao,T.,Zhang,Y.,Zhang,Z.,Liu,H.,Yin,K.,Xu,C.,Kong,H.,2025. Bhvit:Binarizedhybridvisio...

2024
[3]

Hatamizadeh,A.,Kautz,J.,2025

Mamba: Linear-time sequence modeling with selective state spaces, in: First conference on language modeling. Hatamizadeh,A.,Kautz,J.,2025. Mambavision:Ahybridmamba-transformervisionbackbone,in:ProceedingsoftheComputerVisionandPattern Recognition Conference, pp. 25261–25270. Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.,

2025
[4]

YOLOv11: An Overview of the Key Architectural Enhancements

Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 . Lei, M., Li, S., Wu, Y., et al.,

work page internal anchor Pith review arXiv
[5]

arXiv preprint arXiv:2506.17733 (2025)

Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv preprint arXiv:2506.17733 . :Preprint submitted to Elsevier Page 17 of 19 (a) YOLOv13n (b) Ours Figure 11:Qualitative comparison of YOLOv13n and our MDDCNet on the RTOD dataset. It can be observed that our method outperforms YOLOv13n by achieving more accurate r...

work page arXiv
[6]

YOLOv3: An Incremental Improvement

Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 . Ren, S., He, K., Girshick, R., Sun, J.,

work page internal anchor Pith review arXiv
[7]

YOLOv12: Attention-Centric Real-Time Object Detectors

Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524 . Wan,D.,Lu,R.,Shen,S.,Xu,T.,Lang,X.,Ren,Z.,2023. Mixedlocalchannelattentionforobjectdetection. EngineeringApplicationsofArtificial Intelligence 123, 106442. Wang, C.Y., Yeh, I.H., Mark Liao, H.Y.,

work page internal anchor Pith review arXiv 2023
[8]

Yolov9: Learning what you want to learn using programmable gradient information, in: European conference on computer vision, Springer. pp. 1–21. Wang,Q.,Wu,B.,Zhu,P.,Li,P.,Zuo,W.,Hu,Q.,2020. Eca-net:Efficientchannelattentionfordeepconvolutionalneuralnetworks,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11...

2020
[9]

Plainmamba: Improving non-hierarchical mamba in visual recognition,

Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695 . Yuan, S., Qin, H., Yan, X., Yang, S., Yang, S., Akhtar, N., Zhou, H.,

work page arXiv
[10]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 . Zhu, X., Su, W., Lu, L.,et al.,

work page internal anchor Pith review arXiv
[11]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Deformable detr: Deformable transformers forend-to-end object detection. arXiv preprint arXiv:2010.04159 . :Preprint submitted to Elsevier Page 19 of 19

work page internal anchor Pith review arXiv 2010