Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain

Beakcheol Jang; Minkyoung Kim

arxiv: 2605.07577 · v1 · submitted 2026-05-08 · 💻 cs.LG

Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain

Minkyoung Kim , Beakcheol Jang This is my paper

Pith reviewed 2026-05-11 02:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords bilevel optimizationgraph structure learninggraph neural networkstraining dynamicsinner looprewiringcontrol experimentspatio-temporal forecasting

0 comments

The pith

A large share of bilevel graph structure learning gains arises from inner-loop training dynamics rather than graph rewiring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bilevel graph structure learning jointly optimizes a graph neural network and a learned adjacency matrix to improve performance on tasks such as node classification and flow forecasting. The paper demonstrates that much of the observed improvement traces to the repeated inner-loop updates of the model parameters, which carry implicit gradient regularization, rather than to the changes in edge connections alone. A frozen-phi control holds the graph fixed while preserving the original inner training schedule, cleanly separating the training-dynamics channel from the graph-rewiring channel. On spatio-temporal forecasting benchmarks the inner channel alone reproduces 78 to 101 percent of the full bilevel gain; on node classification it reproduces 37 to 44 percent. The authors also supply a three-precondition test that predicts the sign of the bilevel improvement across six datasets.

Core claim

The central claim is that training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain in bilevel graph structure learning. The frozen-phi control decomposes the bilevel gain into an inner channel of T-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101 percent of the gain; on node classification it accounts for 37-44 percent under a Bernoulli edge-level parameterization. Classical spectral diagnostics can dissociate from task gain, and a three-precondition

What carries the argument

The frozen-phi control, which freezes the graph structure while retaining the inner-loop training schedule of T steps, isolates training-dynamics effects from the contribution of graph rewiring.

If this is right

On spatio-temporal forecasting the inner channel alone matches or exceeds the full bilevel gain.
On node classification the inner channel accounts for 37-44 percent of the gain under Bernoulli parameterization.
Classical spectral diagnostics of the learned graph can separate from measured task performance.
A three-precondition test predicts the sign of bilevel gain across all six evaluated benchmarks.
Graph distillation is offered as a method-agnostic complement to the frozen-phi diagnostic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If inner-loop dynamics are the dominant factor, simpler repeated-training procedures without outer-loop structure optimization may suffice for many tasks.
Routine use of frozen controls in bilevel papers would reduce misattribution of gains to structure changes.
The decomposition technique could be applied to other bilevel setups outside graphs to check for similar inner-channel dominance.
The precondition framework might serve as a quick filter to decide whether full bilevel training is worth the extra cost.

Load-bearing premise

Freezing the graph in the control experiment does not change optimization trajectory or implicit regularization in ways that would not occur in the original bilevel procedure.

What would settle it

If the frozen-phi control reproduces less than half the reported bilevel gain on the same six benchmarks under identical hyper-parameters, the claim that inner-channel dynamics dominate would not hold.

Figures

Figures reproduced from arXiv: 2605.07577 by Beakcheol Jang, Minkyoung Kim.

**Figure 1.** Figure 1: Frozen-ϕ isolates training dynamics from graph modification. (Left) The control disables the outer loop while retaining the T-step inner loop, decomposing the bilevel gain as ∆total = ∆inner + ∆graph, where ∆inner = MVanilla −MFrozen-ϕ measures training dynamics with the graph held constant and ∆graph = MFrozen-ϕ − MBilevel measures graph modification with the inner schedule held constant. (Right) Flow for… view at source ↗

**Figure 2.** Figure 2: Two training regimes produce distinct mechanistic signatures. (a) On PeMS04, frozen-ϕ MAE decreases monotonically with T as RIGR ∝ T accumulates within reused mini-batches; at T = 1 frozen-ϕ collapses to vanilla. (b) On Cora and Citeseer, frozen-ϕ accuracy is T-invariant past inner convergence: full-batch gradients with per-outer-iteration parameter reset render the inner trajectory T-independent. Both sig… view at source ↗

**Figure 3.** Figure 3: Three-way decomposition under edge corruption on Cora. Inner channel saturates near [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: PeMS04 (DiffConv, seed=42), epoch 0. Left: pre-clipping gradient norm during the inner loop, averaged across batches. The norm is approximately constant across all 10 inner steps, indicating that each step provides a parameter update of comparable magnitude rather than driving per-batch convergence. Right: per-batch training loss over inner steps for five representative batches; loss trajectories are appro… view at source ↗

**Figure 5.** Figure 5: Pre-clipping gradient norm across inner steps stratified by epoch (PeMS04, DiffConv, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Training and validation MAE on PeMS04 (DiffConv, seed [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: ST weight decay sweep (PeMS04, DiffConv). No weight decay setting reaches frozen- [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: NC weight decay sweep (Cora, Citeseer). Weight decay closes part of the GCN-to-LDS [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Per-seed test MAE for vanilla, frozen-ϕ, and full bilevel on PeMS04, PeMS07, and PeMS08. Frozen-ϕ and vanilla seed distributions are clearly separated on PeMS07 and PeMS08; on PeMS04 they nearly separate (frozen-ϕ max 20.960 versus vanilla min 20.870, gap −0.09 MAE). The pattern supports the central decomposition claim that the inner channel provides a clean, seed-stable contribution distinct from per-seed… view at source ↗

**Figure 10.** Figure 10: Eigenvalue spectrum of the normalized Laplacian on the LCC subgraph (identical node set [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Bilevel-learned edge weight distributions versus the original distance-based Gaussian [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Input-output Jacobian norm stratified by graph distance for vanilla, frozen- [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: NC Jacobian by graph distance. Top row: mean (matches Table 9 mean columns). Bottom [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Triangulated evidence on PeMS07. Three methodologically independent diagnostics, [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Decomposition of LDS gain into inner and graph channels on Cora and Citeseer. Cora: [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Distribution of learned Bernoulli edge probabilities [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: AQ-437 adjacency reordered by city assignment (single-linkage at 80 km, 13 distinct [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗

**Figure 18.** Figure 18: Modification magnitude versus forecasting improvement across six ST datasets. Top: [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

**Figure 19.** Figure 19: Decision flow for attributing gains in bilevel GSL. Diamonds: decision nodes; boxes: [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

read the original abstract

Bilevel graph structure learning is widely understood to improve graph neural networks by jointly optimizing model parameters and a learned graph structure, with the resulting performance gain attributed to the rewired adjacency. We find that this attribution may be overstated: training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain. To establish this, we introduce frozen-$\phi$, a control that freezes the graph while retaining the inner-loop training schedule. This decomposes the bilevel gain into an inner channel of $T$-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101% of the gain; on node classification it accounts for 37-44% under a Bernoulli edge-level parameterization. We also verify that classical spectral diagnostics can dissociate from task gain. We propose frozen-$\phi$ as a standardized diagnostic for bilevel graph structure learning, with graph distillation as a method-agnostic complement. A three-precondition framework further predicts the sign of the bilevel gain on all six benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that inner-loop training dynamics, isolated via frozen-phi, explain most of the reported gains in bilevel graph structure learning rather than the rewiring.

read the letter

The paper's main finding is that training-dynamics effects inside the inner loop, captured by their frozen-phi control, account for a large share of the gains usually credited to bilevel graph structure learning. On flow forecasting this inner channel covers 78-101 percent of the improvement, and on node classification 37-44 percent under a Bernoulli parameterization. They also note that spectral diagnostics often fail to track actual task gains.

Referee Report

2 major / 2 minor

Summary. The manuscript re-examines bilevel graph structure learning (BGSL) for GNNs, arguing that reported performance gains are substantially attributable to training-dynamics effects in the inner loop (T-step optimization with implicit gradient regularization) rather than the graph rewiring itself. The authors introduce a 'frozen-φ' control that retains the inner-loop schedule but freezes the graph parameters, decomposing the bilevel gain into an 'inner channel' and a 'graph channel.' Experiments across spatio-temporal flow forecasting and node classification on six benchmarks show the inner channel accounting for 78-101% and 37-44% of the gains, respectively. They further demonstrate that classical spectral diagnostics can dissociate from task performance and propose a three-precondition framework to predict the sign of bilevel gains, recommending frozen-φ as a standardized diagnostic.

Significance. If the decomposition is valid, the work provides a useful corrective to common attributions in the BGSL literature and supplies a practical diagnostic tool plus a predictive framework that could guide future method design. The consistency of results across two task families and multiple benchmarks is a positive feature, as is the proposal of graph distillation as a complement. These elements could improve evaluation standards in graph structure learning if the control's isolation properties are confirmed.

major comments (2)

[§3.2] §3.2 (Frozen-φ control definition): The central claim that the inner channel captures 78-101% (spatio-temporal) and 37-44% (node classification) of the bilevel gain rests on frozen-φ isolating training-dynamics effects without confounding changes. Freezing the graph removes the outer-loop dependence of model-parameter gradients on graph parameters, which may alter implicit regularization, curvature, or gradient flow relative to the coupled bilevel procedure. This risks mixing genuine inner-loop effects with control artifacts in the reported percentages; explicit checks (e.g., gradient-norm trajectories or effective regularization strength) comparing frozen-φ to the full bilevel run would be required to support the attribution.
[Section 4] Section 4 (Experiments and Tables 2-3): The quantitative attributions are presented as consistent across benchmarks, yet the manuscript omits full hyperparameter schedules, exact data splits, random seeds, and implementation code. Without these, it is impossible to verify that the 78-101% and 37-44% figures are robust rather than sensitive to post-hoc choices, directly affecting confidence in the load-bearing empirical claims.

minor comments (2)

[§5] The three-precondition framework is introduced as predictive, but its derivation appears largely empirical; a brief discussion of how the preconditions were selected versus post-hoc fitting would clarify its generality.
Notation for 'inner channel' versus 'graph channel' is introduced in the abstract and §3 but would benefit from an explicit equation or diagram in the introduction to aid readers unfamiliar with the decomposition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the frozen-φ control and reproducibility that we address below. We provide clarifications on the design rationale and commit to specific revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Frozen-φ control definition): The central claim that the inner channel captures 78-101% (spatio-temporal) and 37-44% (node classification) of the bilevel gain rests on frozen-φ isolating training-dynamics effects without confounding changes. Freezing the graph removes the outer-loop dependence of model-parameter gradients on graph parameters, which may alter implicit regularization, curvature, or gradient flow relative to the coupled bilevel procedure. This risks mixing genuine inner-loop effects with control artifacts in the reported percentages; explicit checks (e.g., gradient-norm trajectories or effective regularization strength) comparing frozen-φ to the full bilevel run would be required to support the attribution.

Authors: We agree that the outer-loop coupling could in principle influence gradient flow and regularization strength, and that explicit verification is valuable. The frozen-φ control is constructed to retain the exact inner-loop optimization schedule (T steps with the same optimizer and loss) while disabling graph updates, thereby removing only the graph-channel contribution. To directly address the concern, the revised manuscript will add gradient-norm trajectory plots and effective regularization diagnostics (e.g., via Hessian trace approximations or loss curvature measures) comparing frozen-φ runs to the full bilevel procedure on the same benchmarks. These will demonstrate that the inner-loop dynamics remain comparable, supporting the reported attribution percentages. revision: yes
Referee: [Section 4] Section 4 (Experiments and Tables 2-3): The quantitative attributions are presented as consistent across benchmarks, yet the manuscript omits full hyperparameter schedules, exact data splits, random seeds, and implementation code. Without these, it is impossible to verify that the 78-101% and 37-44% figures are robust rather than sensitive to post-hoc choices, directly affecting confidence in the load-bearing empirical claims.

Authors: We concur that full reproducibility details are essential for confidence in the quantitative results. The current version summarizes the main settings in Section 4 and the appendix, but we acknowledge the need for exhaustive documentation. In the revised manuscript we will expand the supplementary material to include complete hyperparameter schedules for all methods and baselines, exact train/validation/test splits with indices, all random seeds, and the full implementation code (including the frozen-φ variant) released under an open-source license upon acceptance. This will enable direct replication and robustness checks of the inner-channel percentages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical controls and benchmark comparisons are self-contained

full rationale

The paper's core contribution is an empirical decomposition of bilevel graph structure learning gains using the frozen-φ control to separate inner-loop training dynamics from graph rewiring effects, with reported attributions (78-101% on forecasting, 37-44% on classification) obtained via direct performance measurements against the full bilevel pipeline and baselines across six benchmarks. No mathematical derivation, first-principles result, or prediction reduces by construction to a fitted parameter, self-referential definition, or self-citation chain; the three-precondition framework is presented as an observational predictor of gain sign rather than a tautological renaming. Any prior self-citations on bilevel GSL methods are not load-bearing for the new attribution claims, which rest on independently verifiable experimental controls.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work relies on standard assumptions from bilevel optimization and gradient-based training in machine learning; it introduces the frozen-phi control as a methodological device rather than new mathematical axioms or physical entities.

free parameters (1)

T (inner-loop steps)
The number of inner training steps is a hyperparameter whose value determines how much dynamics are captured in the control.

axioms (1)

domain assumption Inner-loop training dynamics supply implicit gradient regularization whose benefit is largely independent of simultaneous graph changes
Invoked to explain why the frozen-graph schedule recovers most of the bilevel gain.

invented entities (1)

frozen-phi control no independent evidence
purpose: Diagnostic that freezes the graph while preserving the inner-loop training schedule to isolate the dynamics channel
Methodological invention for attribution; no independent falsifiable prediction outside the diagnostic itself.

pith-pipeline@v0.9.0 · 5509 in / 1405 out tokens · 54648 ms · 2026-05-11T02:15:56.792855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

On the bottleneck of graph neural networks and its practical implications

Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. InICLR, 2021

work page 2021
[2]

Adaptive graph convolutional recurrent network for traffic forecasting

Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. InNeurIPS, 2020

work page 2020
[3]

Oversquashing in GNNs through the lens of information contraction and graph expansion

Pradeep Kumar Banerjee, Kedar Karhadkar, Yu Guang Wang, Uri Alon, and Guido Montúfar. Oversquashing in GNNs through the lens of information contraction and graph expansion. In Allerton, 2022

work page 2022
[4]

Barrett and Benoit Dherin

David G.T. Barrett and Benoit Dherin. Implicit gradient regularization. InICLR, 2021

work page 2021
[5]

Understanding oversquashing in GNNs through the lens of effective resistance

Mitchell Black, Zhengchao Wan, Amir Nayyeri, and Yusu Wang. Understanding oversquashing in GNNs through the lens of effective resistance. InICML, 2023

work page 2023
[6]

Yu Chen, Lingfei Wu, and Mohammed J. Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. InNeurIPS, 2020

work page 2020
[7]

Torch spatiotemporal, 2022

Andrea Cini and Ivan Marisca. Torch spatiotemporal, 2022. Software library

work page 2022
[8]

Filling the g_ap_s: Multivariate time series imputation by graph neural networks

Andrea Cini, Ivan Marisca, and Cesare Alippi. Filling the g_ap_s: Multivariate time series imputation by graph neural networks. InICLR, 2022

work page 2022
[9]

Taming local effects in graph-based spatiotemporal forecasting

Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. Taming local effects in graph-based spatiotemporal forecasting. InNeurIPS, 2023

work page 2023
[10]

Sparse graph learning from spatiotemporal time series.Journal of Machine Learning Research, 2023

Andrea Cini, Daniele Zambon, and Cesare Alippi. Sparse graph learning from spatiotemporal time series.Journal of Machine Learning Research, 2023

work page 2023
[11]

Bronstein

Francesco Di Giovanni, Lorenzo Giusti, Federico Barbero, Giulia Luise, Pietro Liò, and Michael M. Bronstein. On over-squashing in message passing neural networks: The impact of width, depth, and topology. InICML, 2023

work page 2023
[12]

SLAPS: Self-supervision improves structure learning for graph neural networks

Bahare Fatemi, Layla El Asri, and Seyed Mehran Kazemi. SLAPS: Self-supervision improves structure learning for graph neural networks. InNeurIPS, 2021

work page 2021
[13]

Learning discrete structures for graph neural networks

Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. InICML, 2019

work page 2019
[14]

Diffusion improves graph learning

Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. InNeurIPS, 2019

work page 2019
[15]

Learning continuous graph structure with bilevel programming for graph neural networks

Minyang Hu, Hong Chang, Bingpeng Ma, and Shiguang Shan. Learning continuous graph structure with bilevel programming for graph neural networks. InIJCAI, 2022

work page 2022
[16]

Spatio-temporal meta-graph learning for traffic forecasting

Renhe Jiang, Zhaonan Wang, Jiawei Yong, Puneet Jeph, Quanjun Chen, Yasumasa Kobayashi, Xuan Song, Shintaro Fukushima, and Toyotaro Suzumura. Spatio-temporal meta-graph learning for traffic forecasting. InAAAI, 2023

work page 2023
[17]

Graph structure learning for robust graph neural networks

Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. InKDD, 2020

work page 2020
[18]

FoSR: First-order spectral rewiring for addressing oversquashing in GNNs

Kedar Karhadkar, Pradeep Kumar Banerjee, and Guido Montúfar. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. InICLR, 2023

work page 2023
[19]

Random search and reproducibility for neural architecture search

Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. InUAI, 2019

work page 2019
[20]

Diffusion convolutional recurrent neural network: Data-driven traffic forecasting

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. InICLR, 2018

work page 2018
[21]

GSLB: The graph structure learning benchmark

Zhixun Li, Liang Wang, Xin Sun, Yifan Luo, Yanqiao Zhu, Dingshuo Chen, Yingtao Luo, Xiangxin Zhou, Qiang Liu, Shu Wu, Jeffrey Xu Yu, and Liang Wang. GSLB: The graph structure learning benchmark. InNeurIPS Datasets and Benchmarks Track, 2023. 10

work page 2023
[22]

M. J. Lighthill and G. B. Whitham. On kinematic waves. II. A theory of traffic flow on long crowded roads.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 229(1178):317–345, 1955

work page 1955
[23]

Spatio-temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting

Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quanjun Chen, and Xuan Song. Spatio-temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting. InCIKM, 2023

work page 2023
[24]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InICLR, 2019

work page 2019
[25]

Towards unsupervised deep graph structure learning

Yixin Liu, Yu Zheng, Daokun Zhang, Hongxu Chen, Hao Peng, and Shirui Pan. Towards unsupervised deep graph structure learning. InWWW, 2022

work page 2022
[26]

Learning latent graph structures and their uncertainty

Alessandro Manenti, Daniele Zambon, and Cesare Alippi. Learning latent graph structures and their uncertainty. InICML, 2025

work page 2025
[27]

Bronstein

Ivan Marisca, Jacob Bamberger, Cesare Alippi, and Michael M. Bronstein. Over-squashing in spatio-temporal graph neural networks. InNeurIPS, 2025

work page 2025
[28]

Revisiting over-smoothing and over-squashing using Ollivier-Ricci curvature

Khang Nguyen, Nong Minh Hieu, Vinh Duc Nguyen, Nhat Ho, Stanley Osher, and Tan Minh Nguyen. Revisiting over-smoothing and over-squashing using Ollivier-Ricci curvature. In ICML, 2023

work page 2023
[29]

Geom-GCN: Geometric graph convolutional networks

Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. InICLR, 2020

work page 2020
[30]

A critical look at the evaluation of GNNs under heterophily: Are we re- ally making progress? InICLR, 2023

Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila Prokhorenkova. A critical look at the evaluation of GNNs under heterophily: Are we re- ally making progress? InICLR, 2023

work page 2023
[31]

Kakade, and Sergey Levine

Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Meta-learning with implicit gradients. InNeurIPS, 2019

work page 2019
[32]

Discrete graph structure learning for forecasting multiple time series

Chao Shang, Jie Chen, and Jinbo Bi. Discrete graph structure learning for forecasting multiple time series. InICLR, 2021

work page 2021
[33]

Smith, Benoit Dherin, David G.T

Samuel L. Smith, Benoit Dherin, David G.T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. InICLR, 2021

work page 2021
[34]

Spatio-temporal latent graph structure learning for traffic forecasting

Jiabin Tang, Tang Qian, Shijing Liu, Shengdong Du, Jie Hu, and Tianrui Li. Spatio-temporal latent graph structure learning for traffic forecasting. InIJCNN, 2022

work page 2022
[35]

Bronstein

Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. InICLR, 2022

work page 2022
[36]

The effectiveness of curvature-based rewiring and the role of hyperparameters in GNNs revisited

Floriano Tori, Vincent Holst, and Vincent Ginis. The effectiveness of curvature-based rewiring and the role of hyperparameters in GNNs revisited. InICLR, 2025

work page 2025
[37]

Lorraine, Fabian Pedregosa, David Duvenaud, and Roger Grosse

Paul Vicol, Jonathan P. Lorraine, Fabian Pedregosa, David Duvenaud, and Roger Grosse. On implicit bias in overparameterized bilevel optimization. InICML, 2022

work page 2022
[38]

Graph structure estimation neural networks

Ruijia Wang, Shuai Mou, Xiao Wang, Wanpeng Xiao, Qi Ju, Chuan Shi, and Xing Xie. Graph structure estimation neural networks. InWWW, 2021

work page 2021
[39]

Graph WaveNet for deep spatial-temporal graph modeling

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph WaveNet for deep spatial-temporal graph modeling. InIJCAI, 2019

work page 2019
[40]

Esperança, and Fabio M

Antoine Yang, Pedro M. Esperança, and Fabio M. Carlucci. NAS evaluation is frustratingly hard. InICLR, 2020

work page 2020
[41]

Cohen, and Ruslan Salakhutdinov

Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InICML, 2016. 11

work page 2016
[42]

Spatio-temporal graph structure learning for traffic forecasting

Qi Zhang, Jianlong Chang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Spatio-temporal graph structure learning for traffic forecasting. InAAAI, 2020

work page 2020
[43]

Forecasting fine-grained air quality based on big data

Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, and Tianrui Li. Forecasting fine-grained air quality based on big data. InKDD, 2015

work page 2015
[44]

OpenGSL: A comprehensive benchmark for graph structure learning

Zhiyao Zhou, Sheng Zhou, Bochao Mao, Xuanyi Zhou, Jiawei Chen, Qiaoyu Tan, Daochen Zha, Yan Feng, Chun Chen, and Can Wang. OpenGSL: A comprehensive benchmark for graph structure learning. InNeurIPS Datasets and Benchmarks Track, 2023. 12 A Algorithm Algorithm 1First-Order Bilevel Rewiring for STGNNs Require: Spatial graph A, STGNN fθ, policy πϕ, warmup ep...

work page 2023
[45]

and Smith et al. [33]). (A4) Per-step stochastic gradient statistics ∥gt∥2 and tr(Σt)/B are bounded. Proof. Barrett and Dherin [4] establish via backward error analysis that a single gradient de- scent step with step size η on Ltrain maps to the exact flow of a modified loss ˜Ltrain = Ltrain + (η/4)∥∇Ltrain∥2 +O(η 2). Smith et al. [33] extend this to SGD ...

work page
[46]

λ=0 (vanilla)

that the modified loss for stochastic gradient descent with per-batch gradients ˆgand batch size B is ˜LtrainSGD =L train + (η/4)(∥g∥2 + tr(Σ)/B) +O(η 2), where g is the per-step mean of ˆgand Σ denotes the per-example gradient covariance (so ˆghas covariance Σ/B). With ϕ held fixed within the inner loop, the per-step RIGR is a functional of the inner-loo...

work page 1989

[1] [1]

On the bottleneck of graph neural networks and its practical implications

Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. InICLR, 2021

work page 2021

[2] [2]

Adaptive graph convolutional recurrent network for traffic forecasting

Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. InNeurIPS, 2020

work page 2020

[3] [3]

Oversquashing in GNNs through the lens of information contraction and graph expansion

Pradeep Kumar Banerjee, Kedar Karhadkar, Yu Guang Wang, Uri Alon, and Guido Montúfar. Oversquashing in GNNs through the lens of information contraction and graph expansion. In Allerton, 2022

work page 2022

[4] [4]

Barrett and Benoit Dherin

David G.T. Barrett and Benoit Dherin. Implicit gradient regularization. InICLR, 2021

work page 2021

[5] [5]

Understanding oversquashing in GNNs through the lens of effective resistance

Mitchell Black, Zhengchao Wan, Amir Nayyeri, and Yusu Wang. Understanding oversquashing in GNNs through the lens of effective resistance. InICML, 2023

work page 2023

[6] [6]

Yu Chen, Lingfei Wu, and Mohammed J. Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. InNeurIPS, 2020

work page 2020

[7] [7]

Torch spatiotemporal, 2022

Andrea Cini and Ivan Marisca. Torch spatiotemporal, 2022. Software library

work page 2022

[8] [8]

Filling the g_ap_s: Multivariate time series imputation by graph neural networks

Andrea Cini, Ivan Marisca, and Cesare Alippi. Filling the g_ap_s: Multivariate time series imputation by graph neural networks. InICLR, 2022

work page 2022

[9] [9]

Taming local effects in graph-based spatiotemporal forecasting

Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. Taming local effects in graph-based spatiotemporal forecasting. InNeurIPS, 2023

work page 2023

[10] [10]

Sparse graph learning from spatiotemporal time series.Journal of Machine Learning Research, 2023

Andrea Cini, Daniele Zambon, and Cesare Alippi. Sparse graph learning from spatiotemporal time series.Journal of Machine Learning Research, 2023

work page 2023

[11] [11]

Bronstein

Francesco Di Giovanni, Lorenzo Giusti, Federico Barbero, Giulia Luise, Pietro Liò, and Michael M. Bronstein. On over-squashing in message passing neural networks: The impact of width, depth, and topology. InICML, 2023

work page 2023

[12] [12]

SLAPS: Self-supervision improves structure learning for graph neural networks

Bahare Fatemi, Layla El Asri, and Seyed Mehran Kazemi. SLAPS: Self-supervision improves structure learning for graph neural networks. InNeurIPS, 2021

work page 2021

[13] [13]

Learning discrete structures for graph neural networks

Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. InICML, 2019

work page 2019

[14] [14]

Diffusion improves graph learning

Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. InNeurIPS, 2019

work page 2019

[15] [15]

Learning continuous graph structure with bilevel programming for graph neural networks

Minyang Hu, Hong Chang, Bingpeng Ma, and Shiguang Shan. Learning continuous graph structure with bilevel programming for graph neural networks. InIJCAI, 2022

work page 2022

[16] [16]

Spatio-temporal meta-graph learning for traffic forecasting

Renhe Jiang, Zhaonan Wang, Jiawei Yong, Puneet Jeph, Quanjun Chen, Yasumasa Kobayashi, Xuan Song, Shintaro Fukushima, and Toyotaro Suzumura. Spatio-temporal meta-graph learning for traffic forecasting. InAAAI, 2023

work page 2023

[17] [17]

Graph structure learning for robust graph neural networks

Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. InKDD, 2020

work page 2020

[18] [18]

FoSR: First-order spectral rewiring for addressing oversquashing in GNNs

Kedar Karhadkar, Pradeep Kumar Banerjee, and Guido Montúfar. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. InICLR, 2023

work page 2023

[19] [19]

Random search and reproducibility for neural architecture search

Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. InUAI, 2019

work page 2019

[20] [20]

Diffusion convolutional recurrent neural network: Data-driven traffic forecasting

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. InICLR, 2018

work page 2018

[21] [21]

GSLB: The graph structure learning benchmark

Zhixun Li, Liang Wang, Xin Sun, Yifan Luo, Yanqiao Zhu, Dingshuo Chen, Yingtao Luo, Xiangxin Zhou, Qiang Liu, Shu Wu, Jeffrey Xu Yu, and Liang Wang. GSLB: The graph structure learning benchmark. InNeurIPS Datasets and Benchmarks Track, 2023. 10

work page 2023

[22] [22]

M. J. Lighthill and G. B. Whitham. On kinematic waves. II. A theory of traffic flow on long crowded roads.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 229(1178):317–345, 1955

work page 1955

[23] [23]

Spatio-temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting

Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quanjun Chen, and Xuan Song. Spatio-temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting. InCIKM, 2023

work page 2023

[24] [24]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InICLR, 2019

work page 2019

[25] [25]

Towards unsupervised deep graph structure learning

Yixin Liu, Yu Zheng, Daokun Zhang, Hongxu Chen, Hao Peng, and Shirui Pan. Towards unsupervised deep graph structure learning. InWWW, 2022

work page 2022

[26] [26]

Learning latent graph structures and their uncertainty

Alessandro Manenti, Daniele Zambon, and Cesare Alippi. Learning latent graph structures and their uncertainty. InICML, 2025

work page 2025

[27] [27]

Bronstein

Ivan Marisca, Jacob Bamberger, Cesare Alippi, and Michael M. Bronstein. Over-squashing in spatio-temporal graph neural networks. InNeurIPS, 2025

work page 2025

[28] [28]

Revisiting over-smoothing and over-squashing using Ollivier-Ricci curvature

Khang Nguyen, Nong Minh Hieu, Vinh Duc Nguyen, Nhat Ho, Stanley Osher, and Tan Minh Nguyen. Revisiting over-smoothing and over-squashing using Ollivier-Ricci curvature. In ICML, 2023

work page 2023

[29] [29]

Geom-GCN: Geometric graph convolutional networks

Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. InICLR, 2020

work page 2020

[30] [30]

A critical look at the evaluation of GNNs under heterophily: Are we re- ally making progress? InICLR, 2023

Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila Prokhorenkova. A critical look at the evaluation of GNNs under heterophily: Are we re- ally making progress? InICLR, 2023

work page 2023

[31] [31]

Kakade, and Sergey Levine

Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Meta-learning with implicit gradients. InNeurIPS, 2019

work page 2019

[32] [32]

Discrete graph structure learning for forecasting multiple time series

Chao Shang, Jie Chen, and Jinbo Bi. Discrete graph structure learning for forecasting multiple time series. InICLR, 2021

work page 2021

[33] [33]

Smith, Benoit Dherin, David G.T

Samuel L. Smith, Benoit Dherin, David G.T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. InICLR, 2021

work page 2021

[34] [34]

Spatio-temporal latent graph structure learning for traffic forecasting

Jiabin Tang, Tang Qian, Shijing Liu, Shengdong Du, Jie Hu, and Tianrui Li. Spatio-temporal latent graph structure learning for traffic forecasting. InIJCNN, 2022

work page 2022

[35] [35]

Bronstein

Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. InICLR, 2022

work page 2022

[36] [36]

The effectiveness of curvature-based rewiring and the role of hyperparameters in GNNs revisited

Floriano Tori, Vincent Holst, and Vincent Ginis. The effectiveness of curvature-based rewiring and the role of hyperparameters in GNNs revisited. InICLR, 2025

work page 2025

[37] [37]

Lorraine, Fabian Pedregosa, David Duvenaud, and Roger Grosse

Paul Vicol, Jonathan P. Lorraine, Fabian Pedregosa, David Duvenaud, and Roger Grosse. On implicit bias in overparameterized bilevel optimization. InICML, 2022

work page 2022

[38] [38]

Graph structure estimation neural networks

Ruijia Wang, Shuai Mou, Xiao Wang, Wanpeng Xiao, Qi Ju, Chuan Shi, and Xing Xie. Graph structure estimation neural networks. InWWW, 2021

work page 2021

[39] [39]

Graph WaveNet for deep spatial-temporal graph modeling

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph WaveNet for deep spatial-temporal graph modeling. InIJCAI, 2019

work page 2019

[40] [40]

Esperança, and Fabio M

Antoine Yang, Pedro M. Esperança, and Fabio M. Carlucci. NAS evaluation is frustratingly hard. InICLR, 2020

work page 2020

[41] [41]

Cohen, and Ruslan Salakhutdinov

Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InICML, 2016. 11

work page 2016

[42] [42]

Spatio-temporal graph structure learning for traffic forecasting

Qi Zhang, Jianlong Chang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Spatio-temporal graph structure learning for traffic forecasting. InAAAI, 2020

work page 2020

[43] [43]

Forecasting fine-grained air quality based on big data

Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, and Tianrui Li. Forecasting fine-grained air quality based on big data. InKDD, 2015

work page 2015

[44] [44]

OpenGSL: A comprehensive benchmark for graph structure learning

Zhiyao Zhou, Sheng Zhou, Bochao Mao, Xuanyi Zhou, Jiawei Chen, Qiaoyu Tan, Daochen Zha, Yan Feng, Chun Chen, and Can Wang. OpenGSL: A comprehensive benchmark for graph structure learning. InNeurIPS Datasets and Benchmarks Track, 2023. 12 A Algorithm Algorithm 1First-Order Bilevel Rewiring for STGNNs Require: Spatial graph A, STGNN fθ, policy πϕ, warmup ep...

work page 2023

[45] [45]

and Smith et al. [33]). (A4) Per-step stochastic gradient statistics ∥gt∥2 and tr(Σt)/B are bounded. Proof. Barrett and Dherin [4] establish via backward error analysis that a single gradient de- scent step with step size η on Ltrain maps to the exact flow of a modified loss ˜Ltrain = Ltrain + (η/4)∥∇Ltrain∥2 +O(η 2). Smith et al. [33] extend this to SGD ...

work page

[46] [46]

λ=0 (vanilla)

that the modified loss for stochastic gradient descent with per-batch gradients ˆgand batch size B is ˜LtrainSGD =L train + (η/4)(∥g∥2 + tr(Σ)/B) +O(η 2), where g is the per-step mean of ˆgand Σ denotes the per-example gradient covariance (so ˆghas covariance Σ/B). With ϕ held fixed within the inner loop, the per-step RIGR is a functional of the inner-loo...

work page 1989