pith. sign in

arxiv: 2601.22678 · v3 · pith:X5JBXBVXnew · submitted 2026-01-30 · 💻 cs.LG

Full-Graph vs. Mini-Batch Training: Comprehensive Analysis from a Batch Size and Fan-Out Size Perspective

Pith reviewed 2026-05-21 13:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords graph neural networksfull-graph trainingmini-batch trainingbatch sizefan-out sizegeneralization boundsampling effects
0
0 comments X

The pith

Full-graph GNN training does not always outperform well-tuned mini-batch settings in accuracy or efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares full-graph and mini-batch training for graph neural networks by viewing full-graph training as the extreme case of the largest possible batch size and fan-out size. It combines experiments with a new theoretical bound to show that simply using larger sizes does not reliably improve convergence, generalization, or computational speed. The analysis highlights uneven effects from batch size versus fan-out size, especially how smaller fan-out values shift the sampled data distribution. Results indicate that resource-limited settings can often reach comparable or superior outcomes with smaller, carefully chosen mini-batches instead of loading the entire graph.

Core claim

The paper establishes that full-graph training, viewed as the extreme case of maximum batch size and fan-out size, does not consistently provide superior model performance or computational efficiency compared to carefully tuned smaller mini-batch configurations in GNNs.

What carries the argument

Batch size and fan-out size as parameters that scale training in GNNs, analyzed through a Wasserstein distance generalization bound that quantifies distribution shift from sampling.

If this is right

  • Batch size and fan-out size produce non-isotropic effects, so they must be tuned separately rather than increased together.
  • Well-tuned mini-batches can achieve similar or better generalization without the memory cost of loading the full graph.
  • Under resource constraints, practitioners gain concrete guidance on selecting these sizes to balance performance and efficiency.
  • Convergence speed and final accuracy depend on graph structure through the fan-out mechanism in ways not captured by batch size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The non-isotropic effects may appear in other sampling-based graph methods that control neighborhood size.
  • Adaptive systems could adjust fan-out dynamically during training based on observed distribution shift.
  • Tests on much larger or time-varying graphs would help check whether the findings hold beyond the evaluated datasets.

Load-bearing premise

The Wasserstein distance analysis correctly measures how fan-out size creates distribution shifts during sampling, and the chosen graphs and models reflect behavior in other cases.

What would settle it

Training the same GNN on a new large graph where full-graph training produces strictly higher accuracy and lower total time than any mini-batch setting with adjusted fan-out sizes would contradict the central claim.

read the original abstract

Full-graph and mini-batch Graph Neural Network (GNN) training approaches have distinct system design demands, making it crucial to choose the appropriate approach to develop. A core challenge in comparing these two GNN training approaches lies in characterizing their model performance (i.e., convergence and generalization) and computational efficiency. While a batch size has been an effective lens in analyzing such behaviors in deep neural networks (DNNs), GNNs extend this lens by introducing a fan-out size, as full-graph training can be viewed as mini-batch training with the largest possible batch size and fan-out size. However, the impact of the batch and fan-out size for GNNs remains insufficiently explored. To this end, this paper systematically compares full-graph vs. mini-batch training of GNNs through empirical and theoretical analyses from the view points of the batch size and fan-out size. Our key contributions include: 1) We provide a novel generalization analysis using the Wasserstein distance to study the impact of the graph structure, especially the fan-out size. 2) We uncover the non-isotropic effects of the batch size and the fan-out size in GNN convergence and generalization, providing practical guidance for tuning these hyperparameters under resource constraints. Finally, full-graph training does not always yield better model performance or computational efficiency than well-tuned smaller mini-batch settings. The implementation can be found in the github link: https://github.com/LIUMENGFAN-gif/GNN_fullgraph_minibatch_training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript compares full-graph and mini-batch GNN training through the lens of batch size and fan-out size. It presents empirical results across multiple graph datasets and GNN architectures, along with a theoretical generalization bound that employs Wasserstein distance to quantify distribution shift induced by neighbor sampling at varying fan-out sizes. The central claim is that full-graph training (largest batch and fan-out) does not always outperform well-tuned smaller mini-batch settings in convergence, generalization, or computational efficiency, and that batch size and fan-out size exhibit non-isotropic effects.

Significance. If the empirical ordering holds under tighter controls, the work supplies actionable hyperparameter guidance for resource-constrained GNN training. The identification of non-isotropic effects is a useful empirical observation. The theoretical bound, however, would need to be shown predictive of the observed gaps to elevate the contribution beyond post-hoc interpretation of convergence curves.

major comments (2)
  1. [Theoretical Analysis] Theoretical section on Wasserstein generalization bound: no explicit tightness check (e.g., numerical comparison of bound value versus measured generalization gap) is reported for different fan-out sizes. Without this verification the bound remains a loose upper bound and does not yet causally ground the claim that fan-out-induced shift explains the non-isotropic performance ordering.
  2. [Experiments] Empirical results section: the experiments demonstrate cases where smaller fan-out wins, yet the analysis does not isolate whether optimization dynamics or graph heterogeneity dominate over the sampling-induced shift. Adding controlled ablations that hold optimizer and graph statistics fixed would be required to support the causal interpretation advanced in the abstract.
minor comments (2)
  1. [Introduction] Introduction: the statement that full-graph training corresponds to the largest possible batch and fan-out size should be accompanied by a short formal definition of fan-out size to avoid ambiguity for readers unfamiliar with neighbor sampling.
  2. [Figures] Figures: convergence plots should include error bars or multiple random seeds and should explicitly annotate the exact batch-size / fan-out-size pairs used in each curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We address the major comments point by point below, outlining the revisions we intend to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical section on Wasserstein generalization bound: no explicit tightness check (e.g., numerical comparison of bound value versus measured generalization gap) is reported for different fan-out sizes. Without this verification the bound remains a loose upper bound and does not yet causally ground the claim that fan-out-induced shift explains the non-isotropic performance ordering.

    Authors: We concur that an explicit check on the tightness of the Wasserstein generalization bound is necessary to better link it to the empirical observations. Accordingly, we will augment the theoretical section with numerical comparisons between the bound values and the measured generalization gaps for different fan-out sizes across the datasets. This will provide evidence on whether the bound is predictive of the performance variations induced by sampling. revision: yes

  2. Referee: [Experiments] Empirical results section: the experiments demonstrate cases where smaller fan-out wins, yet the analysis does not isolate whether optimization dynamics or graph heterogeneity dominate over the sampling-induced shift. Adding controlled ablations that hold optimizer and graph statistics fixed would be required to support the causal interpretation advanced in the abstract.

    Authors: We appreciate this observation. Our current experimental setup fixes the optimizer and learning rate schedule for all configurations to focus on the effects of batch and fan-out sizes. To further isolate the impact of sampling-induced distribution shift from graph heterogeneity and optimization dynamics, we will incorporate additional ablation studies in the revised manuscript. These ablations will involve holding graph statistics constant where possible, for instance through the use of graph generators or by analyzing performance on subgraphs with matched properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external Wasserstein metric and independent experiments.

full rationale

The paper's generalization analysis invokes the Wasserstein distance as an external tool to bound distribution shift from neighbor sampling and fan-out size, without defining the bound or the target performance in terms of each other. Empirical results on convergence and generalization across batch/fan-out regimes are obtained from direct training runs on standard graph datasets and GNN models, rather than by fitting parameters to a subset and relabeling them as predictions. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are load-bearing for the central claim that full-graph training does not always dominate well-tuned mini-batch settings. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions about GNN message passing and sampling; no new entities are postulated. Free parameters are the tested batch and fan-out values chosen for experiments.

free parameters (2)
  • batch sizes tested
    Specific values selected to span full-graph to small mini-batches; fitted in the sense of experimental design rather than model parameters.
  • fan-out sizes tested
    Values chosen to study sampling neighborhood effects in the generalization bound.
axioms (2)
  • domain assumption Wasserstein distance bounds the generalization gap induced by graph sampling
    Invoked in the novel generalization analysis section to connect fan-out size to distribution shift.
  • domain assumption GNN convergence behaves similarly to DNNs when batch size varies
    Used as baseline to highlight non-isotropic effects with fan-out.

pith-pipeline@v0.9.0 · 5812 in / 1332 out tokens · 47969 ms · 2026-05-21T13:52:08.093302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.