Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain
Pith reviewed 2026-05-11 02:15 UTC · model grok-4.3
The pith
A large share of bilevel graph structure learning gains arises from inner-loop training dynamics rather than graph rewiring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain in bilevel graph structure learning. The frozen-phi control decomposes the bilevel gain into an inner channel of T-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101 percent of the gain; on node classification it accounts for 37-44 percent under a Bernoulli edge-level parameterization. Classical spectral diagnostics can dissociate from task gain, and a three-precondition
What carries the argument
The frozen-phi control, which freezes the graph structure while retaining the inner-loop training schedule of T steps, isolates training-dynamics effects from the contribution of graph rewiring.
If this is right
- On spatio-temporal forecasting the inner channel alone matches or exceeds the full bilevel gain.
- On node classification the inner channel accounts for 37-44 percent of the gain under Bernoulli parameterization.
- Classical spectral diagnostics of the learned graph can separate from measured task performance.
- A three-precondition test predicts the sign of bilevel gain across all six evaluated benchmarks.
- Graph distillation is offered as a method-agnostic complement to the frozen-phi diagnostic.
Where Pith is reading between the lines
- If inner-loop dynamics are the dominant factor, simpler repeated-training procedures without outer-loop structure optimization may suffice for many tasks.
- Routine use of frozen controls in bilevel papers would reduce misattribution of gains to structure changes.
- The decomposition technique could be applied to other bilevel setups outside graphs to check for similar inner-channel dominance.
- The precondition framework might serve as a quick filter to decide whether full bilevel training is worth the extra cost.
Load-bearing premise
Freezing the graph in the control experiment does not change optimization trajectory or implicit regularization in ways that would not occur in the original bilevel procedure.
What would settle it
If the frozen-phi control reproduces less than half the reported bilevel gain on the same six benchmarks under identical hyper-parameters, the claim that inner-channel dynamics dominate would not hold.
Figures
read the original abstract
Bilevel graph structure learning is widely understood to improve graph neural networks by jointly optimizing model parameters and a learned graph structure, with the resulting performance gain attributed to the rewired adjacency. We find that this attribution may be overstated: training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain. To establish this, we introduce frozen-$\phi$, a control that freezes the graph while retaining the inner-loop training schedule. This decomposes the bilevel gain into an inner channel of $T$-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101% of the gain; on node classification it accounts for 37-44% under a Bernoulli edge-level parameterization. We also verify that classical spectral diagnostics can dissociate from task gain. We propose frozen-$\phi$ as a standardized diagnostic for bilevel graph structure learning, with graph distillation as a method-agnostic complement. A three-precondition framework further predicts the sign of the bilevel gain on all six benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript re-examines bilevel graph structure learning (BGSL) for GNNs, arguing that reported performance gains are substantially attributable to training-dynamics effects in the inner loop (T-step optimization with implicit gradient regularization) rather than the graph rewiring itself. The authors introduce a 'frozen-φ' control that retains the inner-loop schedule but freezes the graph parameters, decomposing the bilevel gain into an 'inner channel' and a 'graph channel.' Experiments across spatio-temporal flow forecasting and node classification on six benchmarks show the inner channel accounting for 78-101% and 37-44% of the gains, respectively. They further demonstrate that classical spectral diagnostics can dissociate from task performance and propose a three-precondition framework to predict the sign of bilevel gains, recommending frozen-φ as a standardized diagnostic.
Significance. If the decomposition is valid, the work provides a useful corrective to common attributions in the BGSL literature and supplies a practical diagnostic tool plus a predictive framework that could guide future method design. The consistency of results across two task families and multiple benchmarks is a positive feature, as is the proposal of graph distillation as a complement. These elements could improve evaluation standards in graph structure learning if the control's isolation properties are confirmed.
major comments (2)
- [§3.2] §3.2 (Frozen-φ control definition): The central claim that the inner channel captures 78-101% (spatio-temporal) and 37-44% (node classification) of the bilevel gain rests on frozen-φ isolating training-dynamics effects without confounding changes. Freezing the graph removes the outer-loop dependence of model-parameter gradients on graph parameters, which may alter implicit regularization, curvature, or gradient flow relative to the coupled bilevel procedure. This risks mixing genuine inner-loop effects with control artifacts in the reported percentages; explicit checks (e.g., gradient-norm trajectories or effective regularization strength) comparing frozen-φ to the full bilevel run would be required to support the attribution.
- [Section 4] Section 4 (Experiments and Tables 2-3): The quantitative attributions are presented as consistent across benchmarks, yet the manuscript omits full hyperparameter schedules, exact data splits, random seeds, and implementation code. Without these, it is impossible to verify that the 78-101% and 37-44% figures are robust rather than sensitive to post-hoc choices, directly affecting confidence in the load-bearing empirical claims.
minor comments (2)
- [§5] The three-precondition framework is introduced as predictive, but its derivation appears largely empirical; a brief discussion of how the preconditions were selected versus post-hoc fitting would clarify its generality.
- Notation for 'inner channel' versus 'graph channel' is introduced in the abstract and §3 but would benefit from an explicit equation or diagram in the introduction to aid readers unfamiliar with the decomposition.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the frozen-φ control and reproducibility that we address below. We provide clarifications on the design rationale and commit to specific revisions that strengthen the empirical support without altering the core claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Frozen-φ control definition): The central claim that the inner channel captures 78-101% (spatio-temporal) and 37-44% (node classification) of the bilevel gain rests on frozen-φ isolating training-dynamics effects without confounding changes. Freezing the graph removes the outer-loop dependence of model-parameter gradients on graph parameters, which may alter implicit regularization, curvature, or gradient flow relative to the coupled bilevel procedure. This risks mixing genuine inner-loop effects with control artifacts in the reported percentages; explicit checks (e.g., gradient-norm trajectories or effective regularization strength) comparing frozen-φ to the full bilevel run would be required to support the attribution.
Authors: We agree that the outer-loop coupling could in principle influence gradient flow and regularization strength, and that explicit verification is valuable. The frozen-φ control is constructed to retain the exact inner-loop optimization schedule (T steps with the same optimizer and loss) while disabling graph updates, thereby removing only the graph-channel contribution. To directly address the concern, the revised manuscript will add gradient-norm trajectory plots and effective regularization diagnostics (e.g., via Hessian trace approximations or loss curvature measures) comparing frozen-φ runs to the full bilevel procedure on the same benchmarks. These will demonstrate that the inner-loop dynamics remain comparable, supporting the reported attribution percentages. revision: yes
-
Referee: [Section 4] Section 4 (Experiments and Tables 2-3): The quantitative attributions are presented as consistent across benchmarks, yet the manuscript omits full hyperparameter schedules, exact data splits, random seeds, and implementation code. Without these, it is impossible to verify that the 78-101% and 37-44% figures are robust rather than sensitive to post-hoc choices, directly affecting confidence in the load-bearing empirical claims.
Authors: We concur that full reproducibility details are essential for confidence in the quantitative results. The current version summarizes the main settings in Section 4 and the appendix, but we acknowledge the need for exhaustive documentation. In the revised manuscript we will expand the supplementary material to include complete hyperparameter schedules for all methods and baselines, exact train/validation/test splits with indices, all random seeds, and the full implementation code (including the frozen-φ variant) released under an open-source license upon acceptance. This will enable direct replication and robustness checks of the inner-channel percentages. revision: yes
Circularity Check
No circularity: empirical controls and benchmark comparisons are self-contained
full rationale
The paper's core contribution is an empirical decomposition of bilevel graph structure learning gains using the frozen-φ control to separate inner-loop training dynamics from graph rewiring effects, with reported attributions (78-101% on forecasting, 37-44% on classification) obtained via direct performance measurements against the full bilevel pipeline and baselines across six benchmarks. No mathematical derivation, first-principles result, or prediction reduces by construction to a fitted parameter, self-referential definition, or self-citation chain; the three-precondition framework is presented as an observational predictor of gain sign rather than a tautological renaming. Any prior self-citations on bilevel GSL methods are not load-bearing for the new attribution claims, which rest on independently verifiable experimental controls.
Axiom & Free-Parameter Ledger
free parameters (1)
- T (inner-loop steps)
axioms (1)
- domain assumption Inner-loop training dynamics supply implicit gradient regularization whose benefit is largely independent of simultaneous graph changes
invented entities (1)
-
frozen-phi control
no independent evidence
Reference graph
Works this paper leans on
-
[1]
On the bottleneck of graph neural networks and its practical implications
Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications. InICLR, 2021
work page 2021
-
[2]
Adaptive graph convolutional recurrent network for traffic forecasting
Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. InNeurIPS, 2020
work page 2020
-
[3]
Oversquashing in GNNs through the lens of information contraction and graph expansion
Pradeep Kumar Banerjee, Kedar Karhadkar, Yu Guang Wang, Uri Alon, and Guido Montúfar. Oversquashing in GNNs through the lens of information contraction and graph expansion. In Allerton, 2022
work page 2022
-
[4]
David G.T. Barrett and Benoit Dherin. Implicit gradient regularization. InICLR, 2021
work page 2021
-
[5]
Understanding oversquashing in GNNs through the lens of effective resistance
Mitchell Black, Zhengchao Wan, Amir Nayyeri, and Yusu Wang. Understanding oversquashing in GNNs through the lens of effective resistance. InICML, 2023
work page 2023
-
[6]
Yu Chen, Lingfei Wu, and Mohammed J. Zaki. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. InNeurIPS, 2020
work page 2020
-
[7]
Andrea Cini and Ivan Marisca. Torch spatiotemporal, 2022. Software library
work page 2022
-
[8]
Filling the g_ap_s: Multivariate time series imputation by graph neural networks
Andrea Cini, Ivan Marisca, and Cesare Alippi. Filling the g_ap_s: Multivariate time series imputation by graph neural networks. InICLR, 2022
work page 2022
-
[9]
Taming local effects in graph-based spatiotemporal forecasting
Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. Taming local effects in graph-based spatiotemporal forecasting. InNeurIPS, 2023
work page 2023
-
[10]
Sparse graph learning from spatiotemporal time series.Journal of Machine Learning Research, 2023
Andrea Cini, Daniele Zambon, and Cesare Alippi. Sparse graph learning from spatiotemporal time series.Journal of Machine Learning Research, 2023
work page 2023
- [11]
-
[12]
SLAPS: Self-supervision improves structure learning for graph neural networks
Bahare Fatemi, Layla El Asri, and Seyed Mehran Kazemi. SLAPS: Self-supervision improves structure learning for graph neural networks. InNeurIPS, 2021
work page 2021
-
[13]
Learning discrete structures for graph neural networks
Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. InICML, 2019
work page 2019
-
[14]
Diffusion improves graph learning
Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. InNeurIPS, 2019
work page 2019
-
[15]
Learning continuous graph structure with bilevel programming for graph neural networks
Minyang Hu, Hong Chang, Bingpeng Ma, and Shiguang Shan. Learning continuous graph structure with bilevel programming for graph neural networks. InIJCAI, 2022
work page 2022
-
[16]
Spatio-temporal meta-graph learning for traffic forecasting
Renhe Jiang, Zhaonan Wang, Jiawei Yong, Puneet Jeph, Quanjun Chen, Yasumasa Kobayashi, Xuan Song, Shintaro Fukushima, and Toyotaro Suzumura. Spatio-temporal meta-graph learning for traffic forecasting. InAAAI, 2023
work page 2023
-
[17]
Graph structure learning for robust graph neural networks
Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. Graph structure learning for robust graph neural networks. InKDD, 2020
work page 2020
-
[18]
FoSR: First-order spectral rewiring for addressing oversquashing in GNNs
Kedar Karhadkar, Pradeep Kumar Banerjee, and Guido Montúfar. FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. InICLR, 2023
work page 2023
-
[19]
Random search and reproducibility for neural architecture search
Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. InUAI, 2019
work page 2019
-
[20]
Diffusion convolutional recurrent neural network: Data-driven traffic forecasting
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. InICLR, 2018
work page 2018
-
[21]
GSLB: The graph structure learning benchmark
Zhixun Li, Liang Wang, Xin Sun, Yifan Luo, Yanqiao Zhu, Dingshuo Chen, Yingtao Luo, Xiangxin Zhou, Qiang Liu, Shu Wu, Jeffrey Xu Yu, and Liang Wang. GSLB: The graph structure learning benchmark. InNeurIPS Datasets and Benchmarks Track, 2023. 10
work page 2023
-
[22]
M. J. Lighthill and G. B. Whitham. On kinematic waves. II. A theory of traffic flow on long crowded roads.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 229(1178):317–345, 1955
work page 1955
-
[23]
Spatio-temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting
Hangchen Liu, Zheng Dong, Renhe Jiang, Jiewen Deng, Jinliang Deng, Quanjun Chen, and Xuan Song. Spatio-temporal adaptive embedding makes vanilla transformer SOTA for traffic forecasting. InCIKM, 2023
work page 2023
-
[24]
DARTS: Differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InICLR, 2019
work page 2019
-
[25]
Towards unsupervised deep graph structure learning
Yixin Liu, Yu Zheng, Daokun Zhang, Hongxu Chen, Hao Peng, and Shirui Pan. Towards unsupervised deep graph structure learning. InWWW, 2022
work page 2022
-
[26]
Learning latent graph structures and their uncertainty
Alessandro Manenti, Daniele Zambon, and Cesare Alippi. Learning latent graph structures and their uncertainty. InICML, 2025
work page 2025
- [27]
-
[28]
Revisiting over-smoothing and over-squashing using Ollivier-Ricci curvature
Khang Nguyen, Nong Minh Hieu, Vinh Duc Nguyen, Nhat Ho, Stanley Osher, and Tan Minh Nguyen. Revisiting over-smoothing and over-squashing using Ollivier-Ricci curvature. In ICML, 2023
work page 2023
-
[29]
Geom-GCN: Geometric graph convolutional networks
Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. InICLR, 2020
work page 2020
-
[30]
Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, and Liudmila Prokhorenkova. A critical look at the evaluation of GNNs under heterophily: Are we re- ally making progress? InICLR, 2023
work page 2023
-
[31]
Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Meta-learning with implicit gradients. InNeurIPS, 2019
work page 2019
-
[32]
Discrete graph structure learning for forecasting multiple time series
Chao Shang, Jie Chen, and Jinbo Bi. Discrete graph structure learning for forecasting multiple time series. InICLR, 2021
work page 2021
-
[33]
Smith, Benoit Dherin, David G.T
Samuel L. Smith, Benoit Dherin, David G.T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. InICLR, 2021
work page 2021
-
[34]
Spatio-temporal latent graph structure learning for traffic forecasting
Jiabin Tang, Tang Qian, Shijing Liu, Shengdong Du, Jie Hu, and Tianrui Li. Spatio-temporal latent graph structure learning for traffic forecasting. InIJCNN, 2022
work page 2022
- [35]
-
[36]
The effectiveness of curvature-based rewiring and the role of hyperparameters in GNNs revisited
Floriano Tori, Vincent Holst, and Vincent Ginis. The effectiveness of curvature-based rewiring and the role of hyperparameters in GNNs revisited. InICLR, 2025
work page 2025
-
[37]
Lorraine, Fabian Pedregosa, David Duvenaud, and Roger Grosse
Paul Vicol, Jonathan P. Lorraine, Fabian Pedregosa, David Duvenaud, and Roger Grosse. On implicit bias in overparameterized bilevel optimization. InICML, 2022
work page 2022
-
[38]
Graph structure estimation neural networks
Ruijia Wang, Shuai Mou, Xiao Wang, Wanpeng Xiao, Qi Ju, Chuan Shi, and Xing Xie. Graph structure estimation neural networks. InWWW, 2021
work page 2021
-
[39]
Graph WaveNet for deep spatial-temporal graph modeling
Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph WaveNet for deep spatial-temporal graph modeling. InIJCAI, 2019
work page 2019
-
[40]
Antoine Yang, Pedro M. Esperança, and Fabio M. Carlucci. NAS evaluation is frustratingly hard. InICLR, 2020
work page 2020
-
[41]
Cohen, and Ruslan Salakhutdinov
Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. InICML, 2016. 11
work page 2016
-
[42]
Spatio-temporal graph structure learning for traffic forecasting
Qi Zhang, Jianlong Chang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Spatio-temporal graph structure learning for traffic forecasting. InAAAI, 2020
work page 2020
-
[43]
Forecasting fine-grained air quality based on big data
Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, and Tianrui Li. Forecasting fine-grained air quality based on big data. InKDD, 2015
work page 2015
-
[44]
OpenGSL: A comprehensive benchmark for graph structure learning
Zhiyao Zhou, Sheng Zhou, Bochao Mao, Xuanyi Zhou, Jiawei Chen, Qiaoyu Tan, Daochen Zha, Yan Feng, Chun Chen, and Can Wang. OpenGSL: A comprehensive benchmark for graph structure learning. InNeurIPS Datasets and Benchmarks Track, 2023. 12 A Algorithm Algorithm 1First-Order Bilevel Rewiring for STGNNs Require: Spatial graph A, STGNN fθ, policy πϕ, warmup ep...
work page 2023
-
[45]
and Smith et al. [33]). (A4) Per-step stochastic gradient statistics ∥gt∥2 and tr(Σt)/B are bounded. Proof. Barrett and Dherin [4] establish via backward error analysis that a single gradient de- scent step with step size η on Ltrain maps to the exact flow of a modified loss ˜Ltrain = Ltrain + (η/4)∥∇Ltrain∥2 +O(η 2). Smith et al. [33] extend this to SGD ...
-
[46]
that the modified loss for stochastic gradient descent with per-batch gradients ˆgand batch size B is ˜LtrainSGD =L train + (η/4)(∥g∥2 + tr(Σ)/B) +O(η 2), where g is the per-step mean of ˆgand Σ denotes the per-example gradient covariance (so ˆghas covariance Σ/B). With ϕ held fixed within the inner loop, the per-step RIGR is a functional of the inner-loo...
work page 1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.