pith. sign in

arxiv: 2605.21308 · v1 · pith:WI2N5DOYnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

Deformba: Vision State Space Model with Adaptive State Fusion

Pith reviewed 2026-05-21 05:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords state space modelsvision state space modelsadaptive state fusionmulti-modal fusionBEV perceptionimage classificationobject detection
0
0 comments X

The pith

Deformba introduces adaptive state fusion to let vision state space models handle dynamic spatial structures and multi-modal queries while keeping linear complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Deformba to fix two problems with existing vision state space models: they rely on fixed scanning patterns that lock in rigid image geometries, and they struggle with cross-stream fusion because of their causal, self-contained design. The proposed context-adaptive fusion dynamically augments spatial information and supports query-based interactions similar to cross-attention. Experiments demonstrate competitive results on image classification, object detection, segmentation, and BEV 3D perception. A reader would care if this combination of flexibility and linear scaling makes efficient sequence models viable for more demanding visual tasks.

Core claim

Deformba is a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs and allowing multi-modal fusion like cross attention, enabling strong performance on both 2D vision benchmarks and 3D BEV perception tasks.

What carries the argument

Adaptive state fusion, a context-driven mechanism that augments spatial structure on the fly and permits query-based multi-modal interactions without breaking linear-time scaling.

If this is right

  • Vision SSMs can process images without manually designed fixed scanning orders.
  • Multi-view 3D fusion becomes feasible within a linear-complexity SSM framework.
  • The same architecture can be applied to both 2D perception and BEV tasks without redesign.
  • Linear scaling is preserved even when spatial structure is augmented dynamically.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fusion approach may extend to other sequence domains where cross-stream queries are needed.
  • It could reduce reliance on attention layers in multi-modal models if the linear-cost claim holds at scale.
  • Further tests on video or point-cloud sequences would check whether the adaptive mechanism generalizes beyond the reported benchmarks.

Load-bearing premise

The adaptive state fusion can be realized in practice without introducing hidden quadratic costs or needing heavy task-specific tuning.

What would settle it

Runtime or memory measurements on high-resolution images showing quadratic scaling, or benchmark scores that fall below comparable fixed-scan SSM baselines.

Figures

Figures reproduced from arXiv: 2605.21308 by Haoxin Wang, Hongyu Ke, Jack Morris, Kentaro Oguchi, Satoshi Kitai, Yi Ding, Yongkang Liu.

Figure 1
Figure 1. Figure 1: The trade-off between ImageNet-1K top-1 accuracy and inference throughput. Actual hardware used for inference throughput is an NVIDIA RTX 6000 Ada GPU with a batch size 128. It can be seen that under the same inference throughput or accuracy, the accuracy or inference throughput of our proposed Deformba outperforms other methods. gained attention as compelling alternatives to Transform￾ers in domains tradi… view at source ↗
Figure 2
Figure 2. Figure 2: Context-Adaptive State Fusion (CASF). Instead of relying on multiple predefined scanning methods (e.g., sweep￾ing/continuous/local), CASF decouples the SSM write/read and inserts an offset predictor to sample context-adaptive evidence from the hidden state map and fuse it into the current state without additional scanning. and sequential computation, they remain constrained by fixed scanning paths that a s… view at source ↗
Figure 3
Figure 3. Figure 3: Deformba architecture and block designs. The network adopts a four-stage hierarchical backbone with downsampling between stages and stacks Deformba blocks; each block consists of a Context-Adaptive SSM (with CASF) and a ConvFFN under residual connections. space model (S6). As with other linear attention meth￾ods (Katharopoulos et al., 2020; Schlag et al., 2021; Yang et al., 2024b), Mamba employs input-depe… view at source ↗
Figure 4
Figure 4. Figure 4: Deformba Cross Attention. To enhance spatial understanding under Mamba’s causal constraints, each query samples relevant image features using learned offsets. This process enables global spatial interaction without expensive scans. When emulating cross-attention, the state equation above is used to compress information from the key and value input, and the query input sequence decodes the final state ST . … view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of BEV architecture. The general problem setup is we aim to take a set of input image feature maps F and a set of queries B. The module is tasked with learning a function f which selects information from F to update B in a supervised learning problem. This process is done in 3 main steps 1) construction of hidden states of F, 2) construction of hidden states of B, and 3) decoding of hidden sta… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the last Deformba block in Stage 4 using Deformba-T. For each example, we show the input image (left) and the corresponding activation/attribution map (right) from the final block of Stage 4. The model consistently highlights semantically discriminative regions (e.g., object head/body or distinctive textures) while suppressing background, suggesting effective context aggregation by CASF [… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization results of Deformba-Small for 3D bboxes predictions on nuScenes val set. materialization of an L × L attention matrix, the IO memory requirement remains linear O(L). Furthermore, the offset generation requires storing offset maps of size (B, 2G, H, W), where the number of groups G is smaller than the channel dimension C, ensuring that the memory overhead for the adaptive sampling structure re… view at source ↗
read the original abstract

State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Deformba, a vision State Space Model that introduces a context-adaptive state fusion mechanism to dynamically augment spatial structural information in visual inputs. It claims to preserve the linear complexity of SSMs while enabling multi-modal fusion comparable to cross-attention, addressing limitations of fixed scanning in prior vision SSMs and their causal/self-referential nature. The approach is evaluated on 2D tasks (image classification, object detection, segmentation) and 3D tasks (BEV perception), with claims of strong performance across benchmarks.

Significance. If the adaptive fusion mechanism rigorously maintains strict linear complexity without hidden super-linear costs from context-dependent operations or cross-stream interactions, the work would offer a meaningful advance in efficient vision backbones. It could enable broader use of SSMs in multi-view 3D and multi-modal perception settings where query-based fusion is required, providing an alternative to attention-based models with better scaling properties.

major comments (2)
  1. [Abstract] Abstract and method description: The central claims of performance gains, linear complexity preservation, and effective multi-modal fusion are stated without any equations, ablation studies, error bars, dataset details, or complexity derivations. This prevents verification of whether the context-adaptive augmentation and query-based fusion avoid quadratic costs.
  2. [Method] Method section (adaptive state fusion): The replacement of fixed scanning with a learned, context-dependent process and the addition of cross-stream query interactions are presented as preserving O(N) complexity. However, no operation count, recurrence formulation, or proof is given to confirm that state-dependent weights, deformable offsets, or pairwise relations are computed strictly locally or recurrently rather than via global aggregation.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief mention of the specific benchmarks and baseline comparisons used to support the 'strong performance' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We agree that greater detail on equations, ablations, error bars, and complexity derivations would improve verifiability. We address each major comment below and will revise the manuscript accordingly to strengthen these aspects while preserving the core technical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: The central claims of performance gains, linear complexity preservation, and effective multi-modal fusion are stated without any equations, ablation studies, error bars, dataset details, or complexity derivations. This prevents verification of whether the context-adaptive augmentation and query-based fusion avoid quadratic costs.

    Authors: We acknowledge that the abstract is high-level by design. The full manuscript already contains the adaptive state fusion equations in Section 3, ablation studies in Section 4.3, and benchmark details in Section 4.1. To directly address the verification concern, we will add a dedicated complexity analysis subsection with explicit operation counts, a recurrence formulation, and a short proof sketch demonstrating that context-dependent weights and deformable offsets are computed locally and recurrently. We will also include error bars on all reported metrics and expand dataset descriptions. These additions will confirm that no hidden quadratic terms arise from the adaptive or cross-stream components. revision: yes

  2. Referee: [Method] Method section (adaptive state fusion): The replacement of fixed scanning with a learned, context-dependent process and the addition of cross-stream query interactions are presented as preserving O(N) complexity. However, no operation count, recurrence formulation, or proof is given to confirm that state-dependent weights, deformable offsets, or pairwise relations are computed strictly locally or recurrently rather than via global aggregation.

    Authors: The mechanism computes deformable offsets and state-dependent fusion weights via lightweight local operators (small MLPs or convolutions on neighboring features) that feed directly into the standard SSM recurrence; cross-stream queries are likewise folded into the linear state update without global pairwise computation. We agree an explicit derivation was omitted. In revision we will insert the full recurrence equations, an operation-count table, and a brief argument showing that all additional terms scale linearly because they remain position-local and recurrent rather than requiring global aggregation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method introduced as independent architectural proposal

full rationale

The paper presents Deformba as a new context-adaptive state fusion mechanism for vision SSMs that augments spatial structure while preserving linear complexity and enabling multi-modal fusion. No equations or derivations in the provided abstract reduce a claimed prediction or result back to a fitted parameter or self-citation by construction. The central claims rest on the proposed architecture's design choices rather than re-deriving quantities from prior fitted results or self-referential definitions. External benchmarks and experiments are invoked to demonstrate effectiveness, keeping the derivation chain self-contained against independent validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The central claim rests on the unstated premise that the adaptive fusion can be implemented efficiently and generalizes across the listed tasks.

pith-pipeline@v0.9.0 · 5761 in / 1063 out tokens · 28389 ms · 2026-05-21T05:00:02.987043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    MMDetection: Open MMLab Detection Toolbox and Benchmark

    Chen, K., Wang, J., Pang, J., Cao, Y ., Xiong, Y ., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al. Mmdetection: Open mmlab detection toolbox and benchmark.arXiv preprint arXiv:1906.07155,

  2. [2]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  3. [3]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  4. [4]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Gu, A., Goel, K., and R ´e, C. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a. Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and R´e, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34...

  5. [5]

    Damamba: Vision state space model with dynamic adaptive scan.arXiv preprint arXiv:2502.12627,

    Li, T., Li, C., Lyu, J., Pei, H., Zhang, B., Jin, T., and Ji, R. Damamba: Vision state space model with dynamic adaptive scan.arXiv preprint arXiv:2502.12627,

  6. [6]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,

    10 Deformba: Vision State Space Model with Adaptive State Fusion Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Qiao, Y ., and Dai, J. Bevformer: Learning bird’s-eye-view repre- sentation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022a. Li, Z., Wang, W., Xie, E., Yu, Z., Anandkumar, A., Alvarez, J. M.,...

  7. [7]

    Spatial-mamba: Effective visual state space models via structure-aware state fusion,

    Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., and Shao, L. Pvt v2: Improved baselines with pyramid vision transformer.Computational visual media, 8(3):415–424, 2022a. 11 Deformba: Vision State Space Model with Adaptive State Fusion Wang, Y ., Guizilini, V . C., Zhang, T., Wang, Y ., Zhao, H., and Solomon, J. Detr3d: 3d objec...

  8. [8]

    Grootvl: Tree topology is all you need in state space model,

    Xiao, Y ., Song, L., Huang, S., Wang, J., Song, S., Ge, Y ., Li, X., and Shan, Y . Grootvl: Tree topology is all you need in state space model.arXiv preprint arXiv:2406.02395, 2024b. Xu, B., Dai, X., Tang, D., and Zhang, K. One surrogate to fool them all: Universal, transferable, and targeted adver- sarial attacks with clip. InProceedings of the 2025 ACM ...

  9. [9]

    Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., and Crowley, E. J. Plainmamba: Improving non- hierarchical mamba in visual recognition.arXiv preprint arXiv:2403.17695, 2024a. Yang, S., Wang, B., Zhang, Y ., Shen, Y ., and Kim, Y . Par- allelizing linear transformers with the delta rule over se- quence length.Advances in neural informati...

  10. [10]

    MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

    Zhu, C., Lin, Y ., Chen, S., Wang, Y ., and Lin, J. Medeyes: Learning dynamic visual focus for medical progressive diagnosis. InProceedings of the AAAI Conference on Ar- tificial Intelligence, volume 40, pp. 13916–13924, 2026a. Zhu, C., Zeng, J., Jiang, J., Lin, J., and Wang, Y . Medsynapse-v: Bridging visual perception and clinical intuition via latent m...

  11. [11]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. De- formable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159,

  12. [12]

    The encoding function fenc is parameterized by the ResNet backbone, feature pyramid network (FPN), BEV encoder, and temporal fusion modules

    Many of the existing components follow the methodologies of (Li et al., 2022a), (Liu et al., 2023), and (Ke et al., 2025b). The encoding function fenc is parameterized by the ResNet backbone, feature pyramid network (FPN), BEV encoder, and temporal fusion modules. The pipeline has three main stages as follows. First, image feature maps of different scales...

  13. [13]

    We train the models from scratch using a randomly initialized network for the encoder layers

    A 0.1 multiplier is applied to the learning rate of the backbone weights and the deformable attention sampling offsets (Zhu et al., 2020). We train the models from scratch using a randomly initialized network for the encoder layers. For experiments on the COCO dataset, we trained each decoder configuration using 3 decoder layers for 5 epochs at a learning...