Implicit Regularization of Mini-Batch Training in Graph Neural Networks

Antoine Vialle; Clement Wang; Robin Vaysse; Thomas Bonald

arxiv: 2605.22480 · v1 · pith:PYYVJU2Unew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Implicit Regularization of Mini-Batch Training in Graph Neural Networks

Clement Wang , Antoine Vialle , Robin Vaysse , Thomas Bonald This is my paper

Pith reviewed 2026-05-22 06:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords graph neural networksmini-batch trainingimplicit regularizationrandom node samplingbackward error analysisscalable GNN training

0 comments

The pith

Random node sampling for mini-batch GNN training implicitly regularizes by producing lower-variance gradients and losses closer to the full graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the simplest random node sampling scheme for creating mini-batches in graph neural network training matches or beats full-graph training on eight of ten datasets while using far less memory and time. It explains the result by applying backward error analysis to the mini-batch stochastic gradient descent process on graphs. This analysis establishes that the training procedure implicitly optimizes the loss on the sampled subgraph plus an extra regularizer term whose size is proportional to the variance of the gradients computed on each mini-batch. Random node sampling creates batches whose expected loss sits nearer the full-graph loss and whose gradients exhibit lower variance, which produces a more favorable implicit objective even though the method ignores local graph connectivity.

Core claim

Mini-batch training with Random Node Sampling matches or outperforms full-graph training on 8 of 10 datasets because RNS produces mini-batches whose expected loss is closer to the full-graph loss and whose per-batch gradients have lower variance, yielding a better implicit objective.

What carries the argument

Backward error analysis of graph mini-batch SGD, which identifies an implicit regularizer proportional to the variance of per-batch gradients and shaped by the choice of sampler.

If this is right

Sampler selection becomes equivalent to choosing the strength and form of an implicit regularizer.
Random node sampling offers a theoretically grounded baseline for scalable GNN training that avoids the cost of structure-preserving methods.
Gradient variance can serve as a practical diagnostic for predicting which sampler will yield the best final model.
The same variance-driven regularization mechanism may govern performance gaps between samplers on other graph tasks or architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could optimize new samplers explicitly to minimize measured gradient variance rather than to preserve local topology.
The same analysis may extend to mini-batch training on other non-i.i.d. structures such as temporal sequences or spatial grids.
Direct estimation of the implicit regularizer term on held-out data would allow quantitative comparison of different sampling schemes.
The result raises the question of whether full-graph training itself can be improved by adding an explicit variance penalty.

Load-bearing premise

The backward error analysis originally developed for independent mini-batch samples continues to hold when the samples are dependent subgraphs drawn from a graph.

What would settle it

An experiment that measures mini-batch gradient variance across samplers and finds that higher variance under random node sampling still produces better test performance than full-graph training would undermine the claimed link between variance and the implicit objective.

Figures

Figures reproduced from arXiv: 2605.22480 by Antoine Vialle, Clement Wang, Robin Vaysse, Thomas Bonald.

**Figure 1.** Figure 1: Batch sampling strategies for GNN training. Top: mini-batch construction for each method. Bottom: resulting sampled subgraph. Red nodes are targets (Btrain), while blue nodes are auxiliary nodes used only for message passing (B \ Btrain). See Appendix E.1.1 for details. 3 Setting, notation, and sampling strategies We study transductive node-level learning on a graph G = (V, E), where each node v ∈ V has fe… view at source ↗

**Figure 2.** Figure 2: Implicit regularization terms induced by different mini-batch samplers at random initializa [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of the number of batches m. Test accuracy vs. m (batches per epoch) on OGBNARXIV, OGBN-PRODUCTS, and POKEC, shaded areas indicate 95% CI over 5 seeds. We fix the architecture and tune training hyperparameters separately for each value of m 6 Discussion and limitations 6.1 From SGD to Adam: a theory-practice gap Our backward error analysis is derived for vanilla SGD in order to keep the modified-equ… view at source ↗

**Figure 4.** Figure 4: Properties of sampled graphs as a function of the number of batches m. Evolution of subsampled-graph metrics with respect to m (number of batches per epoch). Shaded regions indicate ± one standard deviation over 5 epochs (full pass over all batches). The diameter lower bound corresponds to an estimate of the maximum distance between two nodes within the same connected component, computed using multi-sweep … view at source ↗

**Figure 5.** Figure 5: Power-law exponent error |αSampl − αFull| vs. batch gradient variance R(w) at random initialization, across samplers and datasets. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_5.png] view at source ↗

**Figure 6.** Figure 6: Validation accuracy over 200 epochs for full-graph training and RNS. [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

read the original abstract

Mini-batch training of Graph Neural Networks (GNNs) is fundamentally different from training on i.i.d. data: sampling a subgraph alters the topology and introduces boundary effects, leading prior work to develop structure-aware samplers that preserve local connectivity and reduce embedding variance. Surprisingly, we demonstrate that the simplest possible scheme, Random Node Sampling (RNS), training on the induced subgraph of uniformly sampled nodes, matches or outperforms full-graph training on 8 of 10 datasets at a fraction of the wall-clock time and memory. To explain this, we apply backward error analysis to graph mini-batch Stochastic Gradient Descent (SGD) and show that it implicitly minimizes the sampled loss plus a regularizer proportional to the mini-batch gradient variance, a quantity directly shaped by the sampler. Although RNS discards local structure, it produces mini-batches whose expected loss is closer to the full-graph loss, and whose per-batch gradients have lower variance, yielding a better implicit objective. Our analysis reframes the choice of graph sampler as a form of implicit regularization, and identifies RNS as a strong, theoretically grounded method for scalable GNN training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RNS matches full-graph performance on most datasets here and the backward-error reframing of sampler choice is the real novelty, though the non-i.i.d. extension of the analysis looks shaky.

read the letter

The key point is that random node sampling often matches or beats full-graph training on eight of the ten datasets while using far less time and memory. The paper reframes sampler selection as implicit regularization by applying backward error analysis to show that lower per-batch gradient variance produces a better effective objective than structure-preserving alternatives. That empirical pattern and the variance-based explanation are the main new pieces. The work does a clean job of contrasting RNS against the usual emphasis on preserving local connectivity and of tying the sampler directly to the implicit regularizer term. The derivation itself is presented as a direct extension of existing BAE results for mini-batch SGD. The soft spot is exactly the one the stress-test flags: graph samples are not independent, so subgraphs share neighbors and edges. Standard BAE derivations assume i.i.d. draws and do not automatically cancel the resulting cross-batch covariance terms in the gradient noise. If those extra terms are not bounded or shown to be negligible, the claimed exact proportionality between sampler and implicit objective is only approximate. The abstract does not spell out how the non-i.i.d. structure is handled, which leaves the theoretical account thinner than it first appears. Minor practical questions also remain about whether batch-size normalization or embedding variance controls were matched across samplers. This paper is aimed at researchers who train GNNs at scale and are looking for simpler baselines. The empirical result is sharp enough and the reframing novel enough that it deserves a serious referee, even if the theory section will probably need tightening on the covariance issue.

Referee Report

1 major / 2 minor

Summary. The paper claims that Random Node Sampling (RNS) for mini-batch GNN training matches or outperforms full-graph training on 8 of 10 datasets while using far less time and memory. It applies backward error analysis to graph mini-batch SGD to derive an implicit objective consisting of the sampled loss plus a regularizer proportional to mini-batch gradient variance, a quantity shaped by the sampler. The analysis concludes that RNS produces batches whose expected loss is closer to the full-graph loss and whose gradients have lower variance, yielding a superior implicit objective and reframing sampler choice as implicit regularization.

Significance. If the central derivation and empirical results hold, the work provides a theoretically grounded explanation for why a simple sampler can outperform more complex structure-preserving alternatives, identifies a concrete implicit regularizer, and supplies reproducible empirical evidence across 10 datasets. These elements strengthen the case for viewing graph sampling through the lens of implicit regularization rather than solely through bias-variance trade-offs in embedding quality.

major comments (1)

[Backward error analysis] Backward error analysis section: The derivation states that the implicit regularizer is exactly proportional to mini-batch gradient variance shaped by the sampler. In the graph setting, independently drawn node samples produce dependent subgraphs that share neighbors and edges; this introduces non-zero cross-batch covariances in the gradient noise. The manuscript does not appear to include or bound these covariance terms, so the claimed exact proportionality between sampler choice and the implicit objective may not hold without additional analysis.

minor comments (2)

[Experiments] The empirical section would benefit from an explicit table listing per-dataset performance deltas (accuracy or loss) between RNS, full-graph training, and at least one structure-preserving baseline, together with standard deviations over multiple runs.
[Theoretical analysis] Notation for the variance term in the implicit objective should be defined once and used consistently; currently the same symbol appears to be overloaded between per-batch gradient variance and the regularizer coefficient.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We are grateful for the positive assessment of our work's significance and for the constructive feedback on the backward error analysis. We address the major comment below.

read point-by-point responses

Referee: [Backward error analysis] Backward error analysis section: The derivation states that the implicit regularizer is exactly proportional to mini-batch gradient variance shaped by the sampler. In the graph setting, independently drawn node samples produce dependent subgraphs that share neighbors and edges; this introduces non-zero cross-batch covariances in the gradient noise. The manuscript does not appear to include or bound these covariance terms, so the claimed exact proportionality between sampler choice and the implicit objective may not hold without additional analysis.

Authors: We appreciate the referee's detailed comment on the backward error analysis. The analysis in the manuscript applies backward error analysis to the mini-batch SGD updates in the graph setting. The implicit regularizer is derived as being proportional to the variance of the mini-batch gradient estimator, which depends on the sampling method. While we acknowledge that sampled subgraphs can share neighbors leading to dependencies, these dependencies primarily influence the covariance between gradient estimates from different batches. However, the backward error analysis focuses on the local effect of each stochastic update, for which the relevant quantity is indeed the variance of the individual gradient. The cross terms do not appear in the leading-order implicit objective. We will revise the manuscript to include a brief explanation of why the cross-batch covariances do not affect the claimed proportionality, thereby addressing this concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives its central claim by applying backward error analysis (a standard technique from prior optimization literature) to graph mini-batch SGD, obtaining an implicit regularizer term proportional to per-batch gradient variance. This proportionality follows directly from the BAE expansion and is not obtained by fitting to the target performance or by redefining the sampler in terms of the regularizer. The observation that RNS produces lower variance (and thus a better implicit objective) is a consequence of the sampler definition interacting with the independently derived regularizer, not a self-referential reduction. No load-bearing step reduces to a self-citation chain, an ansatz smuggled via citation, or a fitted input renamed as prediction. The derivation is therefore self-contained against external benchmarks of BAE and the reported empirical results on the 10 datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of backward error analysis to graph-structured mini-batch SGD and on the empirical comparison across 10 datasets; no new entities are postulated and no parameters are fitted to produce the implicit regularizer.

axioms (1)

domain assumption Backward error analysis for SGD extends to the non-i.i.d. mini-batch setting induced by graph sampling
Invoked to derive the implicit regularizer proportional to gradient variance.

pith-pipeline@v0.9.0 · 5732 in / 1412 out tokens · 40291 ms · 2026-05-22T06:45:49.999457+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply backward error analysis to graph mini-batch Stochastic Gradient Descent (SGD) and show that it implicitly minimizes the sampled loss plus a regularizer proportional to the mini-batch gradient variance, a quantity directly shaped by the sampler.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

˜LSampl(w) = ¯LSampl(w) + ϵ/4 ∥∇¯LSampl(w)∥² + ϵ/4m Σ ∥∇ˆLBkSampl(w) − ∇¯LSampl(w)∥²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data

Gleb Bazhenov, Oleg Platonov, and Liudmila Prokhorenkova. Graphland: Evaluating graph machine learning models on diverse industrial data.arXiv preprint arXiv:2409.14500,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling

doi: 10.48550/arXiv.1801.10247. Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 257–266,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.10247
[3]

12 A Implicit objective derivation via backward analysis Our proofs closely follow the framework and several key arguments of Smith et al. [2021]. We refer readers to that work for additional background and intuition. A.1 Full graph training Setup and assumptions.We consider gradient descent with step sizeα >0, wt+1 =w t +αf(w t),(9) where f:R d →R d is t...

work page 2021
[4]

small finite

=w(t) +ϵf(w) +ϵ 2 g(w) + 1 2 ∇f(w)f(w) +O(ϵ 3), (28) using (11). We choose α=ϵ/m so that the cumulative step size after m discrete updates is the fixed increment ϵ, allowing us to compare the m-step discrete map directly to the continuous-time expansion over timeϵ. Matching the Euler discretization wk+1 =w k +ϵf(w k),(29) yields g(w) =− 1 2 ∇f(w)f(w).(30)...

work page 2021
[5]

Consequently Pr(v∈B| |B train|=t)̸=t/N t in general, and the cancellation in Step 1 breaks

For ClusterGCN and GraphSAINT, Lemma 2 fails: clusters are not exchangeable, and random-walk samplers have stationary distribution proportional to degree, so conditional on |Btrain|=t the supervised intersection Btrain isnotuniform over t-subsets of V train. Consequently Pr(v∈B| |B train|=t)̸=t/N t in general, and the cancellation in Step 1 breaks. The bi...

work page 1999
[6]

Combined with ν(d) =Cd −α(1 +o(1)) and the fact that d→ ∞ uniformly on Wk, this yields ν(d) =C k q −α (1 +o(1)),uniformly ford∈W k

Sincek→ ∞, eventuallyk≥d 0, and then for everyd≥k, ν(d)≤C ′d−α ≤C ′k−α.(∗) Ford∈W k, d k/q = 1 +O qk−1/2+ε = 1 +o(1), uniformly ind∈W k, so d−α = k q −α (1 +o(1)) uniformly. Combined with ν(d) =Cd −α(1 +o(1)) and the fact that d→ ∞ uniformly on Wk, this yields ν(d) =C k q −α (1 +o(1)),uniformly ford∈W k. 21 Step 5: Decomposition.Split ep(k) = X d∈Wk ν(d)H...

work page 2020
[7]

OGBN-ARXIV.OGBN-ARXIVis a directed citation network of Computer Science arXiv papers indexed by MAG Hu et al. [2020]. Nodes correspond to papers, and directed edges represent citation relationships. Each node is associated with a 128-dimensional feature vector obtained by averaging word embeddings from the paper title and abstract. We use the official OGB...

work page 2020
[8]

Each dataset is provided with two official random split regimes, Random Low (RL) and Random High (RH), which differ in the amount of labeled data available for training. In the RL regime, nodes are randomly partitioned into train/validation/test sets with proportions 10%/10%/80%, whereas in the RH regime the corresponding proportions are 50%/25%/25%. We r...

work page 2019
[9]

24 Algorithm 1RNS data loading for one epoch Require:GraphG= (V, E), device, number of partitionsP 1:G←G.to(device) 2:n← |V| 3:b← ⌊n/P⌋ 4:π←torch.randperm(n,device=device) 5:fori= 0,1, . . . , P−1do 6:s←i·b 7:t←s+b 8:S i ←π[s:t] 9:G i ←extract_subgraph_from_nodes(G, S i) 10:G i ←to_sparse_tensor(G i) 11:yieldG i 12:end for E.1.2 Fair hyperparameter tuning...

work page 2024
[10]

Second, foreach sampling method and each choice of sampling hyperparameters, we independently tune thetraining-related hyperparameters

By keeping the architecture fixed, we reduce the risk that differences in accuracy are driven by model design rather than by sampling. Second, foreach sampling method and each choice of sampling hyperparameters, we independently tune thetraining-related hyperparameters. Concretely, for every dataset and every candidate sampling configuration, we run a sep...

work page 2000
[11]

For neighbor sampling, we train for 20 epochs only. This choice accounts for its substantially larger number of mini-batches per epoch, resulting from the smaller batch size, and its considerably higher training cost relative to the other methods (see Appendix F.5). Sampling hyperparameter search spaces.We now describe the search spaces used for the metho...

work page 2019
[12]

For memory-intensive settings, we use activation checkpointing: specifically, for full-graph training onPOKEC-REGIONS and for RNS training onWEB-FRAUD

These experiments are run on NVIDIA L40S GPUs with 48GB of memory. For memory-intensive settings, we use activation checkpointing: specifically, for full-graph training onPOKEC-REGIONS and for RNS training onWEB-FRAUD. For full-graph training onWEB-FRAUDandWEB-TOPICS, activation checkpointing alone is not sufficient; therefore, we use layer-parallel train...

work page 2024
[13]

does not report the SGFormer hyperparameters, so we instead adopt those from the original SGFormer paper Wu et al. [2023]. BecauseOGBN-PRODUCTSis not included in Wu et al. [2023], we use for this dataset the hyperparameter configuration reported for Amazon2M, which is the closest large-scale benchmark considered in that work. This choice may partly explai...

work page 2023

[1] [1]

GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data

Gleb Bazhenov, Oleg Platonov, and Liudmila Prokhorenkova. Graphland: Evaluating graph machine learning models on diverse industrial data.arXiv preprint arXiv:2409.14500,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling

doi: 10.48550/arXiv.1801.10247. Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 257–266,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.10247

[3] [3]

12 A Implicit objective derivation via backward analysis Our proofs closely follow the framework and several key arguments of Smith et al. [2021]. We refer readers to that work for additional background and intuition. A.1 Full graph training Setup and assumptions.We consider gradient descent with step sizeα >0, wt+1 =w t +αf(w t),(9) where f:R d →R d is t...

work page 2021

[4] [4]

small finite

=w(t) +ϵf(w) +ϵ 2 g(w) + 1 2 ∇f(w)f(w) +O(ϵ 3), (28) using (11). We choose α=ϵ/m so that the cumulative step size after m discrete updates is the fixed increment ϵ, allowing us to compare the m-step discrete map directly to the continuous-time expansion over timeϵ. Matching the Euler discretization wk+1 =w k +ϵf(w k),(29) yields g(w) =− 1 2 ∇f(w)f(w).(30)...

work page 2021

[5] [5]

Consequently Pr(v∈B| |B train|=t)̸=t/N t in general, and the cancellation in Step 1 breaks

For ClusterGCN and GraphSAINT, Lemma 2 fails: clusters are not exchangeable, and random-walk samplers have stationary distribution proportional to degree, so conditional on |Btrain|=t the supervised intersection Btrain isnotuniform over t-subsets of V train. Consequently Pr(v∈B| |B train|=t)̸=t/N t in general, and the cancellation in Step 1 breaks. The bi...

work page 1999

[6] [6]

Combined with ν(d) =Cd −α(1 +o(1)) and the fact that d→ ∞ uniformly on Wk, this yields ν(d) =C k q −α (1 +o(1)),uniformly ford∈W k

Sincek→ ∞, eventuallyk≥d 0, and then for everyd≥k, ν(d)≤C ′d−α ≤C ′k−α.(∗) Ford∈W k, d k/q = 1 +O qk−1/2+ε = 1 +o(1), uniformly ind∈W k, so d−α = k q −α (1 +o(1)) uniformly. Combined with ν(d) =Cd −α(1 +o(1)) and the fact that d→ ∞ uniformly on Wk, this yields ν(d) =C k q −α (1 +o(1)),uniformly ford∈W k. 21 Step 5: Decomposition.Split ep(k) = X d∈Wk ν(d)H...

work page 2020

[7] [7]

OGBN-ARXIV.OGBN-ARXIVis a directed citation network of Computer Science arXiv papers indexed by MAG Hu et al. [2020]. Nodes correspond to papers, and directed edges represent citation relationships. Each node is associated with a 128-dimensional feature vector obtained by averaging word embeddings from the paper title and abstract. We use the official OGB...

work page 2020

[8] [8]

Each dataset is provided with two official random split regimes, Random Low (RL) and Random High (RH), which differ in the amount of labeled data available for training. In the RL regime, nodes are randomly partitioned into train/validation/test sets with proportions 10%/10%/80%, whereas in the RH regime the corresponding proportions are 50%/25%/25%. We r...

work page 2019

[9] [9]

24 Algorithm 1RNS data loading for one epoch Require:GraphG= (V, E), device, number of partitionsP 1:G←G.to(device) 2:n← |V| 3:b← ⌊n/P⌋ 4:π←torch.randperm(n,device=device) 5:fori= 0,1, . . . , P−1do 6:s←i·b 7:t←s+b 8:S i ←π[s:t] 9:G i ←extract_subgraph_from_nodes(G, S i) 10:G i ←to_sparse_tensor(G i) 11:yieldG i 12:end for E.1.2 Fair hyperparameter tuning...

work page 2024

[10] [10]

Second, foreach sampling method and each choice of sampling hyperparameters, we independently tune thetraining-related hyperparameters

By keeping the architecture fixed, we reduce the risk that differences in accuracy are driven by model design rather than by sampling. Second, foreach sampling method and each choice of sampling hyperparameters, we independently tune thetraining-related hyperparameters. Concretely, for every dataset and every candidate sampling configuration, we run a sep...

work page 2000

[11] [11]

For neighbor sampling, we train for 20 epochs only. This choice accounts for its substantially larger number of mini-batches per epoch, resulting from the smaller batch size, and its considerably higher training cost relative to the other methods (see Appendix F.5). Sampling hyperparameter search spaces.We now describe the search spaces used for the metho...

work page 2019

[12] [12]

For memory-intensive settings, we use activation checkpointing: specifically, for full-graph training onPOKEC-REGIONS and for RNS training onWEB-FRAUD

These experiments are run on NVIDIA L40S GPUs with 48GB of memory. For memory-intensive settings, we use activation checkpointing: specifically, for full-graph training onPOKEC-REGIONS and for RNS training onWEB-FRAUD. For full-graph training onWEB-FRAUDandWEB-TOPICS, activation checkpointing alone is not sufficient; therefore, we use layer-parallel train...

work page 2024

[13] [13]

does not report the SGFormer hyperparameters, so we instead adopt those from the original SGFormer paper Wu et al. [2023]. BecauseOGBN-PRODUCTSis not included in Wu et al. [2023], we use for this dataset the hyperparameter configuration reported for Amazon2M, which is the closest large-scale benchmark considered in that work. This choice may partly explai...

work page 2023