pith. sign in

arxiv: 2606.01660 · v1 · pith:BILF5XXXnew · submitted 2026-06-01 · 💻 cs.LG

Gate the Filter, Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs

Pith reviewed 2026-06-28 15:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords pre-propagation GNNsgraph filtersmixture of expertsChebyshev polynomialsnode-channel adaptationscalable graph learningheterophilic graphs
0
0 comments X

The pith

A 3D-gated mixture of Chebyshev filter experts enables joint node- and channel-adaptive filtering in pre-propagation GNNs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that PPGNN performance differences arise mainly from how filter coefficients are shared across nodes versus channels, not from raw aggregator power. MLP designs learn channel-dependent filters shared across nodes, while hop-attention designs learn node-dependent mixtures shared across channels. This leaves a gap for joint adaptation that respects the pre-propagation contract. FilterMoE fills the gap by routing a small bank of learnable Chebyshev experts through a 3D gating tensor, producing consistent gains without dataset-specific aggregator tuning.

Core claim

Existing PPGNNs differ primarily in filter coefficient sharing patterns: channel-dependent but node-shared for MLPs, node-dependent but channel-shared for hop-attention. The missing joint node-channel regime can be realized by routing a bank of learnable Chebyshev filters through a 3D gating tensor, producing a single model that outperforms prior PPGNNs on nine of eleven benchmarks and leads on all large-scale ones with a 1.53-point average gain.

What carries the argument

A 3D gating tensor that routes a small bank of learnable Chebyshev filter experts jointly over nodes and feature channels.

If this is right

  • Outperforms strong PPGNN baselines on nine of eleven homophilic and heterophilic benchmarks.
  • Ranks first on all three large-scale benchmarks tested.
  • Delivers a 1.53-point average test-score improvement over prior designs.
  • Supplies a single architecture that serves as a robust alternative to dataset-specific hop-aggregator selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The routing mechanism could be applied to other polynomial bases such as Bernstein or monomial filters while preserving the pre-propagation property.
  • Joint node-channel adaptation may reduce the engineering effort spent on per-dataset hyperparameter searches over aggregator families.
  • The same 3D gating idea might transfer to other scalable GNN families that separate propagation from learned parameters.

Load-bearing premise

Performance gaps among existing PPGNNs are explained mainly by differences in how filter coefficients are shared across nodes and channels rather than by differences in raw aggregator capacity.

What would settle it

A controlled test showing that a PPGNN variant with higher raw aggregator capacity but the same node-channel sharing pattern as an MLP or hop-attention model consistently beats FilterMoE on the large-scale benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.01660 by Zhiru Zhang, Zichao Yue.

Figure 1
Figure 1. Figure 1: ∆ test score (%) of HOGA vs. SIGN on eight graphs. HOGA loses to SIGN on heterophilic graphs while outperforming SIGN on homophilic ones. Setting HOGA heads = hidden_channels (triangles) does not close the gap consistently. A useful way to explain this behavior is through the lens of graph filtering. Let Z = [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture overview of FilterMoE Mixture-of-experts on graphs. Mixture-of-experts (MoE) models provide conditional computation through input-dependent routing [Jacobs et al., 1991]. Sparse MoE layers scale this idea by activating only a subset of experts per input, as in sparsely gated MoE, GShard, and Switch Transformer [Shazeer et al., 2017, Lepikhin et al., 2021, Fedus et al., 2022]. Graph learning ha… view at source ↗
Figure 3
Figure 3. Figure 3: Runtime–accuracy trade-off on pokec. Runtime includes training and evaluation, but excludes one-time preprocessing, and is averaged over 10 runs; [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Pre-propagation graph neural networks (PPGNNs) push all graph-dependent computation into a preprocessing step and train only on the resulting dense hop features, which makes them highly scalable. A puzzle in this regime is that more complex hop aggregators do not reliably outperform simpler ones: on many benchmarks, a plain MLP-based aggregator matches or beats hop-attention variants. We revisit this behavior from a graph-filter perspective. Over a precomputed diffusion basis, existing PPGNNs differ mainly in how filter coefficients are shared across nodes and feature channels, rather than simply in raw aggregator capacity. MLP-based architectures learn channel-dependent filters that are largely shared across nodes, while hop-attention-based architectures learn node-dependent mixtures that are largely shared across channels. This reveals a missing regime in standard PPGNN designs: joint node- and channel-adaptive filtering under the pre-propagation computational contract. We propose FilterMoE, a mixture-of-experts PPGNN in which a small bank of learnable Chebyshev filter experts is routed jointly over nodes and channels by a 3D gating tensor. Across eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets and ranks first on all three large-scale benchmarks, improving the average test score by 1.53 points. These results establish joint node-channel filter routing as a robust alternative to dataset-specific hop-aggregator selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes FilterMoE, a pre-propagation GNN architecture that routes a small bank of learnable Chebyshev filter experts via a 3D gating tensor to realize joint node- and channel-adaptive filtering. It argues from a graph-filter perspective that prior PPGNNs differ mainly in how they share filter coefficients across nodes versus channels (MLP-style channel-dependent but node-shared; hop-attention-style node-dependent but channel-shared), identifies the joint regime as missing, and reports that FilterMoE outperforms strong PPGNN baselines on nine of eleven homophilic and heterophilic benchmarks while ranking first on all three large-scale ones with a 1.53-point average test-score gain.

Significance. If the performance advantage can be isolated to the joint node-channel routing mechanism, the work would supply a scalable, dataset-agnostic PPGNN design that replaces per-dataset aggregator selection with a single flexible filter-mixture regime. The multi-benchmark evaluation spanning homophilic and heterophilic graphs, together with emphasis on the three largest datasets, constitutes a concrete empirical contribution; the graph-filter unification of existing PPGNN designs is a useful organizing lens.

major comments (1)
  1. [Abstract and experimental evaluation] Abstract and experimental evaluation: the claim that PPGNN performance gaps are explained primarily by differences in node/channel filter sharing (rather than raw aggregator capacity) is load-bearing for motivating the 3D-gated MoE, yet the reported comparisons supply neither parameter counts nor FLOPs for the baselines, nor capacity-matched ablations that hold total expert parameters or routing overhead fixed. Consequently the 1.53-point average improvement and first-place large-scale results do not yet isolate the joint-sharing regime from the simple effect of increased effective capacity introduced by multiple learnable experts plus the 3D gating tensor.
minor comments (2)
  1. [Abstract] The abstract and results tables do not report per-run variance, standard errors, or statistical significance tests for the claimed improvements.
  2. [Method] Notation for the 3D gating tensor and its routing over nodes, channels, and experts should be introduced with an explicit equation or diagram in the method section to clarify the pre-propagation contract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the experimental controls. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and experimental evaluation] Abstract and experimental evaluation: the claim that PPGNN performance gaps are explained primarily by differences in node/channel filter sharing (rather than raw aggregator capacity) is load-bearing for motivating the 3D-gated MoE, yet the reported comparisons supply neither parameter counts nor FLOPs for the baselines, nor capacity-matched ablations that hold total expert parameters or routing overhead fixed. Consequently the 1.53-point average improvement and first-place large-scale results do not yet isolate the joint-sharing regime from the simple effect of increased effective capacity introduced by multiple learnable experts plus the 3D gating tensor.

    Authors: We agree that parameter counts and FLOPs are necessary to better isolate the contribution of joint node-channel routing. In the revised manuscript we will add a table reporting parameter counts and estimated FLOPs for FilterMoE and all baselines. We will also include a capacity-controlled ablation that scales a single-expert Chebyshev baseline to match the total parameter budget of the mixture model, allowing direct comparison of the routing mechanism versus raw capacity. These additions will address the concern without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an architectural proposal (FilterMoE with 3D-gated Chebyshev experts) plus empirical validation on eleven benchmarks. The motivating graph-filter perspective on node vs. channel sharing in PPGNNs is presented as an interpretive lens rather than a derived equation. No load-bearing mathematical steps, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. Performance gains are reported as experimental outcomes, not quantities forced by construction from the inputs. The contribution remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that precomputed diffusion bases suffice for the filter learning task and on the modeling choice that a small fixed bank of Chebyshev experts plus a learned 3D gate can capture the desired adaptation; no new physical entities are postulated.

free parameters (2)
  • number of filter experts
    Hyperparameter controlling the size of the expert bank; its value is not derived from first principles.
  • Chebyshev polynomial degree
    Hyperparameter setting the order of the polynomial approximation for each expert filter.
axioms (1)
  • domain assumption Precomputed diffusion bases capture sufficient graph structure for downstream filter learning.
    Invoked when the paper states that all graph-dependent computation is pushed into preprocessing.
invented entities (1)
  • 3D gating tensor no independent evidence
    purpose: Routes filter experts jointly over nodes and channels.
    New architectural component introduced to realize the joint-adaptation regime.

pith-pipeline@v0.9.1-grok · 5785 in / 1601 out tokens · 32616 ms · 2026-06-28T15:34:04.477153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Mixture of experts meets decoupled message passing: Towards general and adaptive node classification.CoRR, abs/2412.08193,

    Xuanze Chen, Jiajun Zhou, Shanqing Yu, and Qi Xuan. Mixture of experts meets decoupled message passing: Towards general and adaptive node classification.CoRR, abs/2412.08193,

  2. [3]

    Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, and Federico Monti

    URLhttps://jmlr.org/papers/v23/21-0998.html. Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, and Federico Monti. SIGN: Scalable Inception Graph Neural Networks.arXiv preprint arXiv:2004.11198,

  3. [4]

    Anderson, and Somayeh Sojoudi

    Fernando Gama, Brendon G. Anderson, and Somayeh Sojoudi. Node-variant graph filters in graph neural networks.arXiv preprint arXiv:2106.00089,

  4. [5]

    URL https://arxiv.org/abs/2106.00089

    doi: 10.48550/arXiv.2106.00089. URL https://arxiv.org/abs/2106.00089. Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict Then Propagate: Graph Neural Networks Meet Personalized PageRank.International Conference on Learning Representations (ICLR),

  5. [6]

    Schoenholz, Patrick F

    Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research, pages 1263–1272. PMLR,

  6. [7]

    Jingwei Guo, Kaizhu Huang, Xinping Yi, and Rui Zhang

    URLhttp://proceedings.mlr.press/v70/gilmer17a.html. Jingwei Guo, Kaizhu Huang, Xinping Yi, and Rui Zhang. Graph neural networks with diverse spectral filtering. InProceedings of the ACM web conference 2023, pages 306–316,

  7. [8]

    Node-wise filtering in graph neural networks: A mixture of experts approach.CoRR, abs/2406.03464,

    Haoyu Han, Juanhui Li, Wei Huang, Xianfeng Tang, Hanqing Lu, Chen Luo, Hui Liu, and Jiliang Tang. Node-wise filtering in graph neural networks: A mixture of experts approach.CoRR, abs/2406.03464,

  8. [9]

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba

    doi: 10.48550/ARXIV .2406.03464. URL https://doi.org/10.48550/ arXiv.2406.03464. 10 Mingguo He, Zhewei Wei, and Ji-Rong Wen. Convolutional neural networks on graphs with chebyshev approximation, revisited. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Con...

  9. [10]

    Graphdive: Graph classification by mixture of diverse experts

    Fenyu Hu, Liping Wang, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. Graphdive: Graph classification by mixture of diverse experts. In Luc De Raedt, editor,Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 2080–2086. ijcai.org,

  10. [11]

    Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec

    doi: 10.24963/IJCAI.2022/289. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs.Conf. on Neural Information Processing Systems (NeurIPS),

  11. [12]

    NeuralComputation3,79–87

    doi: 10.1162/NECO.1991.3.1.79. URL https://doi.org/10.1162/neco.1991.3.1.79. Thomas N Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks.International Conference on Learning Representations (ICLR),

  12. [13]

    Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup

    URLhttps://arxiv.org/abs/2006.16668. Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. Revisiting heterophily for graph neural networks.Advances in neural information processing systems, 35:1362–1375,

  13. [14]

    A critical look at the evaluation of gnns under heterophily: Are we really making progress?arXiv preprint arXiv:2302.11640,

    Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila Prokhorenkova. A critical look at the evaluation of gnns under heterophily: Are we really making progress?arXiv preprint arXiv:2302.11640,

  14. [15]

    Recipe for a general, powerful, scalable graph transformer

    Ladislav Rampásek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Ad- vances in Neural Information Processing Systems 35: Annual Conference on Neural Infor- mation Processi...

  15. [16]

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V

    URL http://papers.nips.cc/paper_files/paper/2022/hash/ 5d4834a159f1547b267a05a4e2b7cf5e-Abstract-Conference.html. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Represe...

  16. [17]

    net/forum?id=B1ckMDqlg

    URLhttps://openreview. net/forum?id=B1ckMDqlg. Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation.arXiv preprint arXiv:1811.05868,

  17. [18]

    Adaptive graph diffusion networks.arXiv preprint arXiv:2012.15024,

    Chuxiong Sun, Jie Hu, Hongming Gu, Jinpeng Chen, and Mingchuan Yang. Adaptive graph diffusion networks.arXiv preprint arXiv:2012.15024,

  18. [19]

    Fastestimationoftr(𝑓(𝑎))viastochasticLanczosquadrature

    doi: 10.1137/16M1104974. Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph Attention Networks.International Conference on Learning Representations (ICLR),

  19. [20]

    Graph mixture of experts: Learning on large-scale graphs with explicit diversity modeling

    Haotao Wang, Ziyu Jiang, Yuning You, Yan Han, Gaowen Liu, Jayanth Srinivasa, Ramana Kompella, and Zhangyang Wang. Graph mixture of experts: Learning on large-scale graphs with explicit diversity modeling. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36...

  20. [21]

    Xiyuan Wang and Muhan Zhang

    URL http://papers.nips.cc/paper_files/paper/ 2023/hash/9f4064d145bad5e361206c3303bda7b8-Abstract-Conference.html. Xiyuan Wang and Muhan Zhang. How powerful are spectral graph neural networks. InInternational conference on machine learning, pages 23341–23362. PMLR,

  21. [22]

    Mixture of weak and strong experts on graphs

    Hanqing Zeng, Hanjia Lyu, Diyi Hu, Yinglong Xia, and Jiebo Luo. Mixture of weak and strong experts on graphs. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  22. [23]

    St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

  23. [24]

    It is intended for readers less familiar with pre-propagation GNNs (PP-GNNs). The key distinction is where graph-dependent computation occurs: message-passing GNNs (MP-GNNs) repeatedly aggregate over the graph during training, whereas PP-GNNs amortize graph propagation into a preprocessing stage and train a dense predictor over cached graph-diffused featu...

  24. [25]

    fixed” denotes the standard public split; “random

    Split protocol.We follow the standard split protocol for each dataset. For the four homophily datasets amazon-computer, amazon-photo, coauthor-cs, and coauthor-physics, we report mean±standard deviation over random splits. All other datasets use their fixed public splits. C Hardware settings For the training efficiency study, we use a Linux server with a ...

  25. [26]

    Shared spectral preprocessing.All datasets use the same SLQ setup for the spectral response sketches in Sec

    Each configuration is selected by Optuna TPE and then evaluated over 10 random seeds. Shared spectral preprocessing.All datasets use the same SLQ setup for the spectral response sketches in Sec. 3.2: 20 random-vector probes, 50 Lanczos iterations per probe, and a P= 64 point weighted spectral grid {(θp, wp)}64 p=1 on the rescaled graph Laplacian. This gri...

  26. [27]

    –” means no projection.routeris either the Direct joint MLP router or the Response-aware two-stage router from Sec. 3.3. k is the sparse-routing top- k value, with “dense

    For each tuned baseline, we run an Optuna TPE search on the validation split, select the best validation configuration, and report the mean and standard deviation over 10 random seeds. Unless otherwise specified, each search uses 50 trials per dataset. The same protocol is used for the Chebyshev-operator variants of SIGN, HOGA, and GAMLP in Appendix G: th...