Full-Graph vs. Mini-Batch Training: Comprehensive Analysis from a Batch Size and Fan-Out Size Perspective
Pith reviewed 2026-05-21 13:52 UTC · model grok-4.3
The pith
Full-graph GNN training does not always outperform well-tuned mini-batch settings in accuracy or efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that full-graph training, viewed as the extreme case of maximum batch size and fan-out size, does not consistently provide superior model performance or computational efficiency compared to carefully tuned smaller mini-batch configurations in GNNs.
What carries the argument
Batch size and fan-out size as parameters that scale training in GNNs, analyzed through a Wasserstein distance generalization bound that quantifies distribution shift from sampling.
If this is right
- Batch size and fan-out size produce non-isotropic effects, so they must be tuned separately rather than increased together.
- Well-tuned mini-batches can achieve similar or better generalization without the memory cost of loading the full graph.
- Under resource constraints, practitioners gain concrete guidance on selecting these sizes to balance performance and efficiency.
- Convergence speed and final accuracy depend on graph structure through the fan-out mechanism in ways not captured by batch size alone.
Where Pith is reading between the lines
- The non-isotropic effects may appear in other sampling-based graph methods that control neighborhood size.
- Adaptive systems could adjust fan-out dynamically during training based on observed distribution shift.
- Tests on much larger or time-varying graphs would help check whether the findings hold beyond the evaluated datasets.
Load-bearing premise
The Wasserstein distance analysis correctly measures how fan-out size creates distribution shifts during sampling, and the chosen graphs and models reflect behavior in other cases.
What would settle it
Training the same GNN on a new large graph where full-graph training produces strictly higher accuracy and lower total time than any mini-batch setting with adjusted fan-out sizes would contradict the central claim.
read the original abstract
Full-graph and mini-batch Graph Neural Network (GNN) training approaches have distinct system design demands, making it crucial to choose the appropriate approach to develop. A core challenge in comparing these two GNN training approaches lies in characterizing their model performance (i.e., convergence and generalization) and computational efficiency. While a batch size has been an effective lens in analyzing such behaviors in deep neural networks (DNNs), GNNs extend this lens by introducing a fan-out size, as full-graph training can be viewed as mini-batch training with the largest possible batch size and fan-out size. However, the impact of the batch and fan-out size for GNNs remains insufficiently explored. To this end, this paper systematically compares full-graph vs. mini-batch training of GNNs through empirical and theoretical analyses from the view points of the batch size and fan-out size. Our key contributions include: 1) We provide a novel generalization analysis using the Wasserstein distance to study the impact of the graph structure, especially the fan-out size. 2) We uncover the non-isotropic effects of the batch size and the fan-out size in GNN convergence and generalization, providing practical guidance for tuning these hyperparameters under resource constraints. Finally, full-graph training does not always yield better model performance or computational efficiency than well-tuned smaller mini-batch settings. The implementation can be found in the github link: https://github.com/LIUMENGFAN-gif/GNN_fullgraph_minibatch_training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares full-graph and mini-batch GNN training through the lens of batch size and fan-out size. It presents empirical results across multiple graph datasets and GNN architectures, along with a theoretical generalization bound that employs Wasserstein distance to quantify distribution shift induced by neighbor sampling at varying fan-out sizes. The central claim is that full-graph training (largest batch and fan-out) does not always outperform well-tuned smaller mini-batch settings in convergence, generalization, or computational efficiency, and that batch size and fan-out size exhibit non-isotropic effects.
Significance. If the empirical ordering holds under tighter controls, the work supplies actionable hyperparameter guidance for resource-constrained GNN training. The identification of non-isotropic effects is a useful empirical observation. The theoretical bound, however, would need to be shown predictive of the observed gaps to elevate the contribution beyond post-hoc interpretation of convergence curves.
major comments (2)
- [Theoretical Analysis] Theoretical section on Wasserstein generalization bound: no explicit tightness check (e.g., numerical comparison of bound value versus measured generalization gap) is reported for different fan-out sizes. Without this verification the bound remains a loose upper bound and does not yet causally ground the claim that fan-out-induced shift explains the non-isotropic performance ordering.
- [Experiments] Empirical results section: the experiments demonstrate cases where smaller fan-out wins, yet the analysis does not isolate whether optimization dynamics or graph heterogeneity dominate over the sampling-induced shift. Adding controlled ablations that hold optimizer and graph statistics fixed would be required to support the causal interpretation advanced in the abstract.
minor comments (2)
- [Introduction] Introduction: the statement that full-graph training corresponds to the largest possible batch and fan-out size should be accompanied by a short formal definition of fan-out size to avoid ambiguity for readers unfamiliar with neighbor sampling.
- [Figures] Figures: convergence plots should include error bars or multiple random seeds and should explicitly annotate the exact batch-size / fan-out-size pairs used in each curve.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable suggestions. We address the major comments point by point below, outlining the revisions we intend to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Theoretical Analysis] Theoretical section on Wasserstein generalization bound: no explicit tightness check (e.g., numerical comparison of bound value versus measured generalization gap) is reported for different fan-out sizes. Without this verification the bound remains a loose upper bound and does not yet causally ground the claim that fan-out-induced shift explains the non-isotropic performance ordering.
Authors: We concur that an explicit check on the tightness of the Wasserstein generalization bound is necessary to better link it to the empirical observations. Accordingly, we will augment the theoretical section with numerical comparisons between the bound values and the measured generalization gaps for different fan-out sizes across the datasets. This will provide evidence on whether the bound is predictive of the performance variations induced by sampling. revision: yes
-
Referee: [Experiments] Empirical results section: the experiments demonstrate cases where smaller fan-out wins, yet the analysis does not isolate whether optimization dynamics or graph heterogeneity dominate over the sampling-induced shift. Adding controlled ablations that hold optimizer and graph statistics fixed would be required to support the causal interpretation advanced in the abstract.
Authors: We appreciate this observation. Our current experimental setup fixes the optimizer and learning rate schedule for all configurations to focus on the effects of batch and fan-out sizes. To further isolate the impact of sampling-induced distribution shift from graph heterogeneity and optimization dynamics, we will incorporate additional ablation studies in the revised manuscript. These ablations will involve holding graph statistics constant where possible, for instance through the use of graph generators or by analyzing performance on subgraphs with matched properties. revision: yes
Circularity Check
No significant circularity; derivation relies on external Wasserstein metric and independent experiments.
full rationale
The paper's generalization analysis invokes the Wasserstein distance as an external tool to bound distribution shift from neighbor sampling and fan-out size, without defining the bound or the target performance in terms of each other. Empirical results on convergence and generalization across batch/fan-out regimes are obtained from direct training runs on standard graph datasets and GNN models, rather than by fitting parameters to a subset and relabeling them as predictions. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are load-bearing for the central claim that full-graph training does not always dominate well-tuned mini-batch settings. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- batch sizes tested
- fan-out sizes tested
axioms (2)
- domain assumption Wasserstein distance bounds the generalization gap induced by graph sampling
- domain assumption GNN convergence behaves similarly to DNNs when batch size varies
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.