pith. machine review for the scientific record. sign in

arxiv: 2604.17324 · v1 · submitted 2026-04-19 · 💻 cs.LG · cs.AI

Recognition: unknown

SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords graph transformersover-smoothingsigmoid gatingattention mechanismsmolecular property predictionGraphGPS
0
0 comments X

The pith

Learned sigmoid gates on attention outputs let graph transformers selectively suppress uninformative connections and reduce over-smoothing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Graph transformers on molecular tasks lose performance with depth because node features become indistinguishable, a problem tied to softmax attention always forcing every node to attend to at least one other node. The paper shows that this same sum-to-one rule also drives attention entropy to collapse. SigGate-GT adds a learned sigmoid gate per attention head inside the GraphGPS backbone; the gate multiplies the attended features and can drive them to zero when the connection carries no signal. Experiments on five benchmarks report that this change cuts over-smoothing, raises entropy, and produces new or matched state-of-the-art numbers with roughly 1 percent extra parameters.

Core claim

Applying per-head learned sigmoid gates to the output of softmax attention inside graph transformers breaks the forced sum-to-one normalization, letting individual heads drive uninformative activations to zero and thereby slowing the progressive collapse of node representations with increasing depth.

What carries the argument

Per-head learned sigmoid gates multiplied element-wise onto the attention output, which selectively scale attended features toward zero without altering the softmax normalization itself.

If this is right

  • Over-smoothing measured by mean absolute difference drops by about 30 percent across 4-to-16 layer models.
  • Attention entropy rises and training remains stable across a tenfold range of learning rates.
  • New state-of-the-art ROC-AUC of 82.47 percent on ogbg-molhiv and matched best MAE of 0.059 on ZINC.
  • Statistically significant gains over the GraphGPS baseline on all five evaluated datasets.
  • Overhead remains near 1 percent parameters while delivering these changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-head zeroing mechanism could be tested on non-graph transformers to reduce attention sinks.
  • Pairing the gates with residual or normalization-based anti-smoothing methods might compound the depth gains.
  • The observed stability over wide learning-rate ranges suggests the gates could reduce the cost of hyper-parameter search on new graph tasks.

Load-bearing premise

Performance gains arise specifically because the gates suppress only uninformative signals rather than from incidental effects of extra parameters or different hyper-parameter choices.

What would settle it

Re-train the identical architecture with the learned sigmoid gates replaced by fixed gates of value 1.0 and check whether the measured reduction in over-smoothing and the accuracy gains both disappear.

Figures

Figures reproduced from arXiv: 2604.17324 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

Figure 1
Figure 1. Figure 1: Architecture of a single SigGate-GT layer. The sigmoid gate (orange, σ￾Gate) modulates the attention output element-wise before the output projection. The dashed arrow indicates the gate’s input-dependent computation from H. 3.3. Why Sigmoid Gating Benefits Graph Transformers The sigmoid gate addresses three fundamental limitations of standard graph transformer attention: Breaking the sum-to-one and consta… view at source ↗
Figure 2
Figure 2. Figure 2: Learning rate sensitivity on ZINC. SigGate-GT maintains stable perfor￾mance across a 10× range of learning rates, whereas GraphGPS degrades sharply above lr=10−3 . Method (GraphGPS + ...) ZINC MAE ↓ Pep-struct MAE ↓ Baseline (no modification) 0.070 ± 0.004 0.2500 ± 0.0012 + DropEdge [22] 0.067 ± 0.003 0.2488 ± 0.0015 + PairNorm [23] 0.068 ± 0.004 0.2491 ± 0.0011 + SigGate (ours) 0.059 ± 0.002 0.2431 ± 0.00… view at source ↗
read the original abstract

Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE) and sets new state-of-the-art on ogbg-molhiv (82.47% ROC-AUC), with statistically significant gains over GraphGPS across all five datasets ($p < 0.05$). Ablations show that gating reduces over-smoothing by 30% (mean relative MAD gain across 4-16 layers), increases attention entropy, and stabilizes training across a $10\times$ learning rate range, with about 1% parameter overhead on OGB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SigGate-GT, a graph transformer built on the GraphGPS framework that applies learned per-head sigmoid gates to attention outputs. This modification is motivated by the observation that softmax attention's sum-to-one constraint contributes to over-smoothing and entropy degeneration (analogous to attention sinks in LLMs). The sigmoid gates allow selective suppression of uninformative signals. On five benchmarks the model matches the prior best MAE of 0.059 on ZINC and achieves new SOTA ROC-AUC of 82.47% on ogbg-molhiv, with statistically significant gains over GraphGPS (p < 0.05) on all datasets. Ablations report a ~30% reduction in over-smoothing (via mean relative MAD across 4-16 layers), increased attention entropy, and training stability over a 10x learning-rate range, at ~1% parameter overhead.

Significance. If the reported gains are causally attributable to the gating mechanism, the work offers a lightweight, architecture-compatible intervention for a well-known limitation of graph transformers. The multi-dataset evaluation, statistical testing, and direct measurement of over-smoothing via MAD provide concrete supporting evidence. The approach could influence subsequent graph transformer designs, particularly in molecular and long-range reasoning tasks, provided the selectivity claim is isolated from capacity or optimization effects.

major comments (2)
  1. [Ablations section] Ablations section: The reported 30% MAD reduction and p<0.05 gains are attributed to selective suppression by the per-head sigmoid gates, yet no control experiment is described that holds total parameter count and architecture fixed while removing selectivity (e.g., replacing the sigmoid gates with learnable scalar multipliers per head or with fixed non-zero gates). Without such an isolation, the central causal claim remains vulnerable to the alternative explanation that gains arise from added capacity or altered training dynamics alone.
  2. [Results section] Results and experimental details: Concrete numbers and p<0.05 significance are stated, but the manuscript provides insufficient protocol information (exact data splits, baseline re-implementations, number of independent runs per result, variance estimates, and multiple-comparison correction) to allow independent verification or to rule out post-hoc selection. This weakens confidence in the SOTA claims on ogbg-molhiv and the cross-dataset significance.
minor comments (1)
  1. [Abstract and experimental setup] The abstract states 'about 1% parameter overhead on OGB' but a table or paragraph giving exact parameter counts for SigGate-GT versus GraphGPS (broken down by component) would strengthen the low-overhead claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the claims and reproducibility.

read point-by-point responses
  1. Referee: [Ablations section] Ablations section: The reported 30% MAD reduction and p<0.05 gains are attributed to selective suppression by the per-head sigmoid gates, yet no control experiment is described that holds total parameter count and architecture fixed while removing selectivity (e.g., replacing the sigmoid gates with learnable scalar multipliers per head or with fixed non-zero gates). Without such an isolation, the central causal claim remains vulnerable to the alternative explanation that gains arise from added capacity or altered training dynamics alone.

    Authors: We agree that a control isolating the selectivity of the sigmoid (its [0,1] bounding for suppression) from added capacity is valuable. The current ablations compare against the unmodified GraphGPS baseline (fewer parameters) and show consistent gains in MAD reduction and entropy. To address the concern directly, we will add a new ablation in the revised manuscript replacing the per-head sigmoid gates with learnable scalar multipliers per head (unbounded, same parameter count). This will demonstrate whether the bounded, suppressive behavior of the sigmoid is necessary for the observed over-smoothing mitigation, holding architecture and capacity fixed. revision: yes

  2. Referee: [Results section] Results and experimental details: Concrete numbers and p<0.05 significance are stated, but the manuscript provides insufficient protocol information (exact data splits, baseline re-implementations, number of independent runs per result, variance estimates, and multiple-comparison correction) to allow independent verification or to rule out post-hoc selection. This weakens confidence in the SOTA claims on ogbg-molhiv and the cross-dataset significance.

    Authors: We acknowledge that additional protocol details are required for full reproducibility and to support the statistical claims. In the revised manuscript, we will expand the experimental details to specify: the exact data splits and preprocessing steps (following official OGB and ZINC splits); that baselines were re-implemented in the same GraphGPS framework with reported hyperparameters; the number of independent runs (10 runs per model with different random seeds); variance estimates (standard deviations reported alongside means); and the statistical procedure (paired t-tests with p < 0.05, including any multiple-comparison correction such as Bonferroni). We will also release code, configurations, and seeds to enable independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with independent benchmark results

full rationale

The paper proposes SigGate-GT as a sigmoid-gated attention modification inside the GraphGPS framework and validates it via standard benchmark experiments (ZINC, ogbg-molhiv, etc.) plus ablations on MAD, entropy, and stability. No derivation chain, equations, or uniqueness theorems are invoked that reduce any claimed result to a fitted parameter or self-citation by construction. The reported metrics and statistical gains are externally falsifiable on public datasets and do not rely on self-referential definitions or renamed inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claim rests on the empirical effectiveness of the proposed gating mechanism. The only non-standard element is the introduction of the sigmoid gate itself; all other components are drawn from the existing GraphGPS framework and standard training practices.

free parameters (1)
  • per-head sigmoid gate weights
    Learned parameters introduced by the new gating layer; their values are determined by gradient descent on the training data rather than chosen by hand.
axioms (1)
  • domain assumption Softmax attention's sum-to-one constraint forces every node to attend to something even when no informative signal exists, contributing to over-smoothing.
    Stated as the root cause observation motivating the work.
invented entities (1)
  • Sigmoid gate applied to attention output no independent evidence
    purpose: To allow selective suppression of uninformative connections and thereby reduce over-smoothing.
    New architectural component proposed in the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5534 in / 1475 out tokens · 44214 ms · 2026-05-10T06:24:04.979837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Neural Message Passing for Quantum Chemistry

    J. Gilmer et al. “Neural Message Passing for Quantum Chemistry”. In:Proceedings of the 34th Inter- national Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Ed. by bdoina Precup and Y. W. Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017, pp. 1263–1272

  2. [2]

    Semi-Supervised Classification with Graph Convolutional Networks

    T. N. Kipf and M. Welling. “Semi-Supervised Classification with Graph Convolutional Networks”. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

  3. [3]

    HowPowerfulareGraphNeuralNetworks?

    K.Xuetal.“HowPowerfulareGraphNeuralNetworks?” In:7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019

  4. [4]

    On the Bottleneck of Graph Neural Networks and its Practical Implications

    U. Alon and E. Yahav. “On the Bottleneck of Graph Neural Networks and its Practical Implications”. In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

  5. [5]

    Attention is all you need

    A. Vaswani et al. “Attention is all you need”. In:Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Long Beach, California, USA: Curran Associates Inc., 2017, 6000–6010.isbn: 9781510860964.url:https://proceedings.neurips.cc/paper_files/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  6. [6]

    Do Transformers Really Perform Badly for Graph Representation?

    C. Ying et al. “Do Transformers Really Perform Badly for Graph Representation?” In:Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by M. Ranzato et al. 2021, pp. 28877– 28888

  7. [7]

    Recipe for a General, Powerful, Scalable Graph Transformer

    L. Rampásek et al. “Recipe for a General, Powerful, Scalable Graph Transformer”. In:Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Ed. by S. Koyejo et al. 2022

  8. [8]

    Graph Inductive Biases in Transformers without Message Passing

    L. Ma et al. “Graph Inductive Biases in Transformers without Message Passing”. In:International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Ed. by A. Krause et al. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 23321–23337

  9. [9]

    Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning

    Q. Li, Z. Han, and X. Wu. “Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning”. In:Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI- 18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Sympo- sium on Educational Advances in Artificial Intelligence (EAAI-...

  10. [10]

    Graph Neural Networks Exponentially Lose Expressive Power for Node Classification

    K. Oono and T. Suzuki. “Graph Neural Networks Exponentially Lose Expressive Power for Node Classification”. In:8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020

  11. [11]

    Graph Convolutions Enrich the Self-Attention in Transformers!

    J. Choi et al. “Graph Convolutions Enrich the Self-Attention in Transformers!” In:Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Ed. by A. Globersons et al. 2024

  12. [12]

    A sur- vey on oversmoothing in graph neural networks,

    T. K. Rusch, M. M. Bronstein, and S. Mishra. “A Survey on Oversmoothing in Graph Neural Net- works”. In:arXiv preprintarXiv.2303.10993 (2023). arXiv preprint:2303.10993

  13. [13]

    Stabilizing Transformer Training by Preventing Attention Entropy Collapse

    S. Zhai et al. “Stabilizing Transformer Training by Preventing Attention Entropy Collapse”. In:Inter- national Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Ed. by A. Krause et al. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 40770– 40803

  14. [14]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention- Sink-Free

    Z. Qiu et al. “Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention- Sink-Free”. In:Advances in Neural Information Processing Systems 38 (NeurIPS 2025). Oral; Best Paper Award. 2025

  15. [15]

    Efficient Streaming Language Models with Attention Sinks

    G. Xiao et al. “Efficient Streaming Language Models with Attention Sinks”. In:The Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  16. [16]

    GLU Variants Improve Transformer

    N. Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprintarXiv.2002.05202 (2020)

  17. [17]

    Sdplib 1.2, a library of semidefinite program- ming test problems.Optimization Methods and Software, 11(1-4):683–690, 1999

    X. Bresson and T. Laurent. “Residual Gated Graph ConvNets”. In:arXiv preprintarXiv.1711.07553 (2017)

  18. [18]

    Gated Graph Sequence Neural Networks

    Y. Li et al. “Gated Graph Sequence Neural Networks”. In:4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by Y. Bengio and Y. LeCun. 2016

  19. [19]

    Rethinking Graph Transformers with Spectral Attention

    D. Kreuzer et al. “Rethinking Graph Transformers with Spectral Attention”. In:Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by M. Ranzato et al. 2021, pp. 21618–21629

  20. [20]

    Exphormer: Sparse Transformers for Graphs

    H. Shirzad et al. “Exphormer: Sparse Transformers for Graphs”. In:arXiv preprintarXiv.2303.06147 (2023)

  21. [21]

    On the Connection Between MPNN and Graph Transformer

    C. Cai et al. “On the Connection Between MPNN and Graph Transformer”. In:International Confer- ence on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. Ed. by A. Krause et al. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 3408–3430

  22. [22]

    DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

    Y. Rong et al. “DropEdge: Towards Deep Graph Convolutional Networks on Node Classification”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020

  23. [23]

    PairNorm: Tackling Oversmoothing in GNNs

    L. Zhao and L. Akoglu. “PairNorm: Tackling Oversmoothing in GNNs”. In:8th International Confer- ence on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe- view.net, 2020

  24. [24]

    DeeperGCN: All You Need to Train Deeper GCNs

    G. Li et al. “DeeperGCN: All You Need to Train Deeper GCNs”. In:arXiv preprintarXiv.2006.07739 (2020)

  25. [25]

    Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,

    J. Ramapuram et al. “Theory, Analysis, and Best Practices for Sigmoid Self-Attention”. In:arXiv preprintarXiv.2409.04431 (2024)

  26. [26]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton. “Layer Normalization”. In:arXiv preprintarXiv.1607.06450 (2016)

  27. [27]

    Benchmarking graph neural networks

    V. P. Dwivedi et al. “Benchmarking graph neural networks”. In:J. Mach. Learn. Res.24.1 (Jan. 2023). issn: 1532-4435

  28. [28]

    Open Graph Benchmark: Datasets for Machine Learning on Graphs

    W. Hu et al. “Open Graph Benchmark: Datasets for Machine Learning on Graphs”. In:Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by H. Larochelle et al. 2020

  29. [29]

    Long Range Graph Benchmark

    V. P. Dwivedi et al. “Long Range Graph Benchmark”. In:Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Ed. by S. Koyejo et al. 2022

  30. [30]

    Sign and Basis Invariant Networks for Spectral Graph Representation Learning

    D. Lim et al. “Sign and Basis Invariant Networks for Spectral Graph Representation Learning”. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  31. [31]

    Graph Neural Networks with Learnable Structural and Positional Represen- tations

    V. P. Dwivedi et al. “Graph Neural Networks with Learnable Structural and Positional Represen- tations”. In:The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

  32. [32]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In:7th International Con- ference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenRe- view.net, 2019

  33. [33]

    Principal Neighbourhood Aggregation for Graph Nets

    G. Corso et al. “Principal Neighbourhood Aggregation for Graph Nets”. In:Advances in Neural Infor- mation Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by H. Larochelle et al. 2020

  34. [34]

    Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark

    J. Tönshoff et al. “Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark”. In: Trans. Mach. Learn. Res.2024 (2024)

  35. [35]

    Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View

    D. Chen et al. “Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intel...

  36. [36]

    show the widest spread and highest fraction of near-zero gates, consistent with aggressive filtering of uninformative node pairs at the representation-building stage; the final layers (8–

  37. [37]

    This pattern is consistent with functional specialization: the network learns where in the depth hierarchy to invest its filtering capacity

    have intermediate behaviour. This pattern is consistent with functional specialization: the network learns where in the depth hierarchy to invest its filtering capacity. Layer Mean Std %<0.1%>0.9 10.71 0.13 2.4% 4.1% 20.66 0.16 5.3% 5.8% 30.62 0.18 8.1% 7.2% 40.54 0.22 15.0% 9.4% 50.51 0.23 17.8% 10.1% 60.53 0.22 16.1% 9.8% 70.55 0.21 14.2% 9.6% 80.58 0.1...