Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling

Xin He; Xin Wang; Yili Wang; Yiwei Dai

arxiv: 2511.06756 · v3 · submitted 2025-11-10 · 💻 cs.LG

Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling

Xin He , Yili Wang , Yiwei Dai , Xin Wang This is my paper

Pith reviewed 2026-05-17 23:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords over-smoothinggraph neural networksMambastate space modelsnode representationdeep GNNsgraph convolutional networksselective state space modeling

0 comments

The pith

A dual-Mamba graph network models node-specific state evolution locally while adding global context to keep representations distinct in deep layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that over-smoothing arises because standard residual and skip connections in GNNs do not track how each node's features change in a progressive, node-specific way from layer to layer. It introduces DMbaGCN, which replaces or augments message passing with two Mamba-based modules. LSEMba uses selective state-space modeling to aggregate neighborhood information while evolving a unique hidden state for every node across layers. GCAMba supplies each node with global graph context through the same selective mechanism. The combined structure is presented as a way to maintain discriminability so that deeper GNNs can be trained without representations collapsing.

Core claim

The central claim is that DMbaGCN, built from Local State-Evolution Mamba (LSEMba) for node-specific local dynamics and Global Context-Aware Mamba (GCAMba) for global information, enhances node discriminability in deep GNNs and thereby mitigates over-smoothing more effectively than residual connections or skip layers alone.

What carries the argument

DMbaGCN framework that pairs LSEMba, which applies selective state-space modeling to capture progressive node-specific representation changes during local neighborhood aggregation, with GCAMba, which injects global context for each node.

If this is right

Deep GNNs using the dual-Mamba structure maintain higher node discriminability than those relying solely on residual connections.
The selective state-space approach allows explicit modeling of how individual node representations evolve layer by layer.
Incorporating global context via GCAMba supplies information that local aggregation alone cannot provide, further reducing convergence of representations.
The resulting architecture demonstrates both effectiveness on node-level tasks and computational efficiency on standard graph benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same node-specific state tracking could be inserted into other message-passing architectures beyond GCNs, such as GAT or GraphSAGE variants.
If the method scales, it may allow reliable training of GNNs with dozens of layers on large graphs where current residual techniques still saturate.
A natural next measurement is whether the learned state transitions inside LSEMba correspond to interpretable structural roles of nodes.

Load-bearing premise

That Mamba's selective state-space modeling can be directly adapted to capture progressive, node-specific representation evolution across GNN layers and that adding global context will meaningfully outperform existing residual or skip-connection techniques.

What would settle it

Train a standard GCN, a residual GCN, and DMbaGCN to 20 or more layers on a fixed benchmark such as Cora or ogbn-arxiv, then measure the average cosine similarity or mutual information among node embeddings at the final layer; if the dual-Mamba version shows no clear reduction in similarity relative to the residual baseline, the mitigation claim is refuted.

Figures

Figures reproduced from arXiv: 2511.06756 by Xin He, Xin Wang, Yili Wang, Yiwei Dai.

**Figure 2.** Figure 2: The Framework of DMbaGCN. LSEMba models the evolution of node representations across GNN layers using [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of Time and Memory Consumption. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of Hyperparameters α and β on Model Performance [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Over-smoothing remains a fundamental challenge in deep Graph Neural Networks (GNNs), where repeated message passing causes node representations to become indistinguishable. While existing solutions, such as residual connections and skip layers, alleviate this issue to some extent, they fail to explicitly model how node representations evolve in a node-specific and progressive manner across layers. Moreover, these methods do not take global information into account, which is also crucial for mitigating the over-smoothing problem. To address the aforementioned issues, in this work, we propose a Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN), which is a novel framework that integrates Mamba into GNNs to address over-smoothing from both local and global perspectives. DMbaGCN consists of two modules: the Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and utilizing Mamba's selective state space modeling to capture node-specific representation dynamics across layers, and the Global Context-Aware Mamba (GCAMba) that leverages Mamba's global attention capabilities to incorporate global context for each node. By combining these components, DMbaGCN enhances node discriminability in deep GNNs, thereby mitigating over-smoothing. Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward a dual-Mamba GNN that tries to keep node representations distinct in deep layers by adding selective local dynamics and global context, but the advantage over standard residuals still needs concrete demonstration.

read the letter

The main takeaway is that this work adapts Mamba's selective state-space modeling into a GCN to tackle over-smoothing from two angles at once. LSEMba handles local neighborhood aggregation while tracking how each node's features evolve layer by layer, and GCAMba brings in global context for every node. The authors argue that residual connections and skips do not explicitly model this progressive, node-specific change, so the dual setup should preserve discriminability better in deeper stacks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DMbaGCN, a Dual Mamba-enhanced Graph Convolutional Network that integrates Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and node-specific representation dynamics via selective state space modeling, together with Global Context-Aware Mamba (GCAMba) for incorporating global context per node. The central claim is that this combination enhances node discriminability in deep GNNs and thereby mitigates over-smoothing, with effectiveness shown through experiments on multiple benchmarks.

Significance. If the empirical results and adaptation hold, the work offers a novel direction for applying selective SSMs to model progressive, node-specific evolution in GNN layers, potentially improving upon residual or skip-connection baselines for deeper architectures. The dual local-global design is a clear strength and could influence future SSM-graph hybrids, though its impact hinges on demonstrating concrete gains in discriminability metrics beyond existing techniques.

major comments (2)

Abstract: the claim that 'extensive experiments on multiple benchmarks demonstrate the effectiveness' supplies no quantitative results, baseline comparisons, over-smoothing metrics (e.g., MAD, Dirichlet energy), or ablation details, leaving the central empirical support for the dual-Mamba claim without visible grounding in the provided text.
Method (LSEMba description): the assertion that selective state-space modeling captures 'node-specific representation dynamics across layers' to maintain discriminability assumes the SSM selection mechanism can counteract homogenization from repeated neighborhood averaging. No derivation is supplied showing how the combined LSEMba+GCAMba modules alter the contraction rate of the layer operator; if the SSM is applied post-aggregation without explicit topology-aware discretization, the update reduces to a learned residual without guaranteed advantage over skip connections.

minor comments (2)

Notation for LSEMba and GCAMba should be accompanied by explicit equations or pseudocode to clarify how the selective SSM is discretized and applied to graph-structured inputs.
The introduction could more explicitly contrast the proposed node-specific progressive modeling with prior residual and attention-based over-smoothing remedies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation and analysis.

read point-by-point responses

Referee: Abstract: the claim that 'extensive experiments on multiple benchmarks demonstrate the effectiveness' supplies no quantitative results, baseline comparisons, over-smoothing metrics (e.g., MAD, Dirichlet energy), or ablation details, leaving the central empirical support for the dual-Mamba claim without visible grounding in the provided text.

Authors: We agree that the abstract would be strengthened by including concrete quantitative highlights. In the revised manuscript we will add specific performance gains (e.g., accuracy improvements on the cited benchmarks) together with references to the over-smoothing metrics (MAD, Dirichlet energy) and ablation results already reported in the experimental section. revision: yes
Referee: Method (LSEMba description): the assertion that selective state-space modeling captures 'node-specific representation dynamics across layers' to maintain discriminability assumes the SSM selection mechanism can counteract homogenization from repeated neighborhood averaging. No derivation is supplied showing how the combined LSEMba+GCAMba modules alter the contraction rate of the layer operator; if the SSM is applied post-aggregation without explicit topology-aware discretization, the update reduces to a learned residual without guaranteed advantage over skip connections.

Authors: We acknowledge that a formal derivation of the contraction-rate change would provide additional theoretical support. Our current argument rests on the empirical behavior of the selective SSM, which permits input-dependent state transitions that adapt per node and per layer; this is distinct from a fixed residual because the selection parameters are conditioned on the current node features and aggregated neighborhood. We will expand the method section with a clearer mechanistic explanation of the LSEMba–GCAMba interaction and will include additional diagnostic plots (e.g., layer-wise MAD curves) to illustrate the effect. A complete contraction-rate analysis remains an open theoretical question that we flag for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with independent empirical validation

full rationale

The paper proposes DMbaGCN as a new architectural combination of LSEMba (for local neighborhood aggregation and node-specific state evolution via selective SSM) and GCAMba (for global context). The central claim—that this mitigates over-smoothing by enhancing node discriminability—is presented as the outcome of the design and supported by benchmark experiments, without any equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs. The derivation chain consists of module definitions and their integration, which remain independent of the performance assertions and do not invoke uniqueness theorems or ansatzes from prior self-work in a load-bearing manner. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that Mamba's selective state-space mechanism transfers effectively to graph-structured data for both local evolution tracking and global context injection; no explicit free parameters, axioms, or invented physical entities are described in the abstract.

pith-pipeline@v0.9.0 · 5529 in / 1141 out tokens · 44499 ms · 2026-05-17T23:57:15.180710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Simba: Simplified mamba-based architecture for vision and multivariate time series

Gramformer: Learning crowd counting via graph- modulated transformer. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 38, 3395–3403. Liu, W.; Zhang, Z.; Li, X.; Hu, J.; Luo, Y .; and Du, J. 2024. Enhancing recommendation systems with GNNs and ad- dressing over-smoothing. In2024 4th International Confer- ence on Electronic Informa...

work page arXiv 2024
[2]

Wu, X.; Ajorlou, A.; Wu, Z.; and Jadbabaie, A

PMLR. Wu, X.; Ajorlou, A.; Wu, Z.; and Jadbabaie, A. 2023. De- mystifying oversmoothing in attention-based graph neural networks.Advances in Neural Information Processing Sys- tems, 36: 35084–35106. Yang, L.; Cai, Y .; Ning, H.; Zhuo, J.; Jin, D.; Ma, Z.; Guo, Y .; Wang, C.; and Wang, Z. 2025a. Universal Graph Self- Contrastive Learning. InIJCAI, 3534–354...

work page 2023
[3]

InPro- ceedings of the 30th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, 3853–3862

Graph bottlenecked social recommendation. InPro- ceedings of the 30th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, 3853–3862. Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y .; and Liu, T.-Y . 2021. Do transformers really perform badly for graph representation?Advances in neural infor- mation processing systems, 34: 28877...

work page 2021

[1] [1]

Simba: Simplified mamba-based architecture for vision and multivariate time series

Gramformer: Learning crowd counting via graph- modulated transformer. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 38, 3395–3403. Liu, W.; Zhang, Z.; Li, X.; Hu, J.; Luo, Y .; and Du, J. 2024. Enhancing recommendation systems with GNNs and ad- dressing over-smoothing. In2024 4th International Confer- ence on Electronic Informa...

work page arXiv 2024

[2] [2]

Wu, X.; Ajorlou, A.; Wu, Z.; and Jadbabaie, A

PMLR. Wu, X.; Ajorlou, A.; Wu, Z.; and Jadbabaie, A. 2023. De- mystifying oversmoothing in attention-based graph neural networks.Advances in Neural Information Processing Sys- tems, 36: 35084–35106. Yang, L.; Cai, Y .; Ning, H.; Zhuo, J.; Jin, D.; Ma, Z.; Guo, Y .; Wang, C.; and Wang, Z. 2025a. Universal Graph Self- Contrastive Learning. InIJCAI, 3534–354...

work page 2023

[3] [3]

InPro- ceedings of the 30th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, 3853–3862

Graph bottlenecked social recommendation. InPro- ceedings of the 30th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, 3853–3862. Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y .; and Liu, T.-Y . 2021. Do transformers really perform badly for graph representation?Advances in neural infor- mation processing systems, 34: 28877...

work page 2021